<div dir="ltr"><br><div class="gmail_extra"><div class="gmail_quote">On Mon, Aug 29, 2016 at 5:39 AM, Anuradha Talur <span dir="ltr"><<a href="mailto:atalur@redhat.com" target="_blank">atalur@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Response inline.<br>
<br>
----- Original Message -----<br>
> From: "Krutika Dhananjay" <<a href="mailto:kdhananj@redhat.com">kdhananj@redhat.com</a>><br>
> To: "David Gossage" <<a href="mailto:dgossage@carouselchecks.com">dgossage@carouselchecks.com</a>><br>
> Cc: "<a href="mailto:gluster-users@gluster.org">gluster-users@gluster.org</a> List" <<a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a>><br>
> Sent: Monday, August 29, 2016 3:55:04 PM<br>
> Subject: Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow<br>
><br>
> Could you attach both client and brick logs? Meanwhile I will try these steps<br>
> out on my machines and see if it is easily recreatable.<br>
><br>
> -Krutika<br>
><br>
> On Mon, Aug 29, 2016 at 2:31 PM, David Gossage < <a href="mailto:dgossage@carouselchecks.com">dgossage@carouselchecks.com</a><br>
> > wrote:<br>
><br>
><br>
><br>
> Centos 7 Gluster 3.8.3<br>
><br>
> Brick1: ccgl1.gl.local:/gluster1/<wbr>BRICK1/1<br>
> Brick2: ccgl2.gl.local:/gluster1/<wbr>BRICK1/1<br>
> Brick3: ccgl4.gl.local:/gluster1/<wbr>BRICK1/1<br>
> Options Reconfigured:<br>
> cluster.data-self-heal-<wbr>algorithm: full<br>
> cluster.self-heal-daemon: on<br>
> cluster.locking-scheme: granular<br>
> features.shard-block-size: 64MB<br>
> features.shard: on<br>
> performance.readdir-ahead: on<br>
> storage.owner-uid: 36<br>
> storage.owner-gid: 36<br>
> performance.quick-read: off<br>
> performance.read-ahead: off<br>
> performance.io-cache: off<br>
> performance.stat-prefetch: on<br>
> cluster.eager-lock: enable<br>
> network.remote-dio: enable<br>
> cluster.quorum-type: auto<br>
> cluster.server-quorum-type: server<br>
> server.allow-insecure: on<br>
> cluster.self-heal-window-size: 1024<br>
> cluster.background-self-heal-<wbr>count: 16<br>
> performance.strict-write-<wbr>ordering: off<br>
> nfs.disable: on<br>
> nfs.addr-namelookup: off<br>
> nfs.enable-ino32: off<br>
> cluster.granular-entry-heal: on<br>
><br>
> Friday did rolling upgrade from 3.8.3->3.8.3 no issues.<br>
> Following steps detailed in previous recommendations began proces of<br>
> replacing and healngbricks one node at a time.<br>
><br>
> 1) kill pid of brick<br>
> 2) reconfigure brick from raid6 to raid10<br>
> 3) recreate directory of brick<br>
> 4) gluster volume start <> force<br>
> 5) gluster volume heal <> full<br>
Hi,<br>
<br>
I'd suggest that full heal is not used. There are a few bugs in full heal.<br>
Better safe than sorry ;)<br>
Instead I'd suggest the following steps:<br>
<br></blockquote><div>Currently I brought the node down by systemctl stop glusterd as I was getting sporadic io issues and a few VM's paused so hoping that will help. I may wait to do this till around 4PM when most work is done in case it shoots load up.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
1) kill pid of brick<br>
2) to configuring of brick that you need<br>
3) recreate brick dir<br>
4) while the brick is still down, from the mount point:<br>
a) create a dummy non existent dir under / of mount.<br></blockquote><div><br></div><div>so if noee 2 is down brick, pick node for example 3 and make a test dir under its brick directory that doesnt exist on 2 or should I be dong this over a gluster mount? </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
b) set a non existent extended attribute on / of mount.<br></blockquote><div><br></div><div>Could you give me an example of an attribute to set? I've read a tad on this, and looked up attributes but haven't set any yet myself.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Doing these steps will ensure that heal happens only from updated brick to down brick.<br>
5) gluster v start <> force<br>
6) gluster v heal <><br></blockquote><div><br></div><div>Will it matter if somewhere in gluster the full heal command was run other day? Not sure if it eventually stops or times out. </div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
><br>
> 1st node worked as expected took 12 hours to heal 1TB data. Load was little<br>
> heavy but nothing shocking.<br>
><br>
> About an hour after node 1 finished I began same process on node2. Heal<br>
> proces kicked in as before and the files in directories visible from mount<br>
> and .glusterfs healed in short time. Then it began crawl of .shard adding<br>
> those files to heal count at which point the entire proces ground to a halt<br>
> basically. After 48 hours out of 19k shards it has added 5900 to heal list.<br>
> Load on all 3 machnes is negligible. It was suggested to change this value<br>
> to full cluster.data-self-heal-<wbr>algorithm and restart volume which I did. No<br>
> efffect. Tried relaunching heal no effect, despite any node picked. I<br>
> started each VM and performed a stat of all files from within it, or a full<br>
> virus scan and that seemed to cause short small spikes in shards added, but<br>
> not by much. Logs are showing no real messages indicating anything is going<br>
> on. I get hits to brick log on occasion of null lookups making me think its<br>
> not really crawling shards directory but waiting for a shard lookup to add<br>
> it. I'll get following in brick log but not constant and sometime multiple<br>
> for same shard.<br>
><br>
> [2016-08-29 08:31:57.478125] W [MSGID: 115009]<br>
> [server-resolve.c:569:server_<wbr>resolve] 0-GLUSTER1-server: no resolution type<br>
> for (null) (LOOKUP)<br>
> [2016-08-29 08:31:57.478170] E [MSGID: 115050]<br>
> [server-rpc-fops.c:156:server_<wbr>lookup_cbk] 0-GLUSTER1-server: 12591783:<br>
> LOOKUP (null) (00000000-0000-0000-00<br>
> 00-000000000000/241a55ed-f0d5-<wbr>4dbc-a6ce-ab784a0ba6ff.221) ==> (Invalid<br>
> argument) [Invalid argument]<br>
><br>
> This one repeated about 30 times in row then nothing for 10 minutes then one<br>
> hit for one different shard by itself.<br>
><br>
> How can I determine if Heal is actually running? How can I kill it or force<br>
> restart? Does node I start it from determine which directory gets crawled to<br>
> determine heals?<br>
><br>
> David Gossage<br>
> Carousel Checks Inc. | System Administrator<br>
> Office 708.613.2284<br>
><br>
> ______________________________<wbr>_________________<br>
> Gluster-users mailing list<br>
> <a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a><br>
> <a href="http://www.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">http://www.gluster.org/<wbr>mailman/listinfo/gluster-users</a><br>
><br>
><br>
> ______________________________<wbr>_________________<br>
> Gluster-users mailing list<br>
> <a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a><br>
> <a href="http://www.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">http://www.gluster.org/<wbr>mailman/listinfo/gluster-users</a><br>
<span class="HOEnZb"><font color="#888888"><br>
--<br>
Thanks,<br>
Anuradha.<br>
</font></span></blockquote></div><br></div></div>