<div dir="ltr"><br clear="all"><div>Apologies for the format beliw, I&#39;ve been getting emails in digest format, I had to c/p email below :</div><div><br></div><div><br></div><div>We have been observing same issue with  glusterfs-3.7.8-1.el7.x86_64, but our investigation shows following :</div><div><br></div><div>- Even after SHD is turned off (self heal daemon) auto-heal process still causes same amount of CPU activity</div><div>- This happens with NFS mounts as well, we&#39;ve tried both FUSE and NFS v3</div><div>- It gets triggered when adding new  nodes/bricks to replicated (or replicated+distributed) volumes</div><div>- It gets especially triggered when autoheal (or SHD as a matter of fact) dives into a directory with 300K+ files in them.</div><div>- We&#39;ve used strace/sdig and other methods, culprit seems like hard coded SHD or AutoHeal threads. We&#39;ve observed SHD uses 4 threads per brick, and we couldn&#39;t find any config way to reduce thread counts.</div><div>- We&#39;ve tried CPU pinning and others, this didn&#39;t solve our root cause, but helped to reduce CPU load on our servers. Without CPU pinning, load goes as high as 100 on 16 core/32CPU w/ HT servers.</div><div><br></div><div>Please let us know if this seems like a bug that needs to be filed.</div><div><br></div><div><br>Thank you </div><div><br></div><div><br></div><div><pre style="white-space:pre-wrap;color:rgb(0,0,0)">Hello,

On 03/23/2016 06:35 PM, Ravishankar N wrote:

&gt;<i> On 03/23/2016 09:53 PM, Marian Marinov wrote:

&gt;&gt;&gt; &gt;What version of gluster is this?

</i>&gt;&gt;<i> 3.7.6

</i>&gt;&gt;<i>

</i>&gt;&gt;&gt;<i> &gt;Do you observe the problem even when only the 4th &#39;non data&#39; server

&gt;&gt;&gt; comes up? In that case it is unlikely that self-heal is the issue.

</i>&gt;&gt;<i> No

</i>&gt;&gt;<i>

&gt;&gt;&gt; &gt;Are the clients using FUSE or NFS mounts?

</i>&gt;&gt;<i> FUSE

</i>&gt;&gt;<i>

</i>&gt;<i> 

</i>&gt;<i> Okay, when the you say the cluster stalls, I&#39;m assuming the apps using

</i>&gt;<i> files via the fuse mount are stalled. Does the mount log contain

</i>&gt;<i> messages about completing selfheals on files when the mount eventually

</i>&gt;<i> becomes responsive?If yes, you could try setting

&gt; &#39;cluster.data-self-heal&#39; to off.

</i>

Yes we have many lines with similar entries in the logs:

[2016-03-22 11:10:23.398668] I [MSGID: 108026]

[afr-self-heal-common.c:651:afr_log_selfheal] 0-share-replicate-0:

Completed data selfheal on b18c2b05-7186-4c22-ab34-24858b1153e5.

source=0 sinks=2

[2016-03-23 13:11:54.110773] I [MSGID: 108026]

[afr-self-heal-common.c:651:afr_log_selfheal] 0-share-replicate-0:

Completed metadata selfheal on 591d2bee-b55c-4dd6-a1bc-8b7fc5571caa.

source=0 sinks=

We already tested setting cluster.self-heal-daemon off and we did not

experience the issue in this case. We stopped one node, disabled

self-heal-daemon, started the node and later enabled the

self-heal-daemon. There was no &quot;stalling&quot; in this case.

We will try the suggested setting too.

-- 

Dimitar Ianakiev

System Administrator

<a href="http://www.siteground.com">www.siteground.com</a></pre></div><div><br></div><div><br></div>-- <br><div class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div dir="ltr"><div dir="ltr"><div><div style="text-align:right"><span style="font-size:12.8px">Kayra Otaner</span></div><div style="text-align:right"><span style="font-size:12.8px">BilgiO A.Ş. -  SecOps Experts</span></div><div style="text-align:right"><span style="font-size:12.8px">PGP KeyID : A945251E </span><span style="font-size:12.8px">| Manager, Enterprise Linux Solutions</span></div></div><div style="text-align:right"><a href="http://www.bilgio.com" target="_blank">www.bilgio.com</a> |  TR +90 (532) 111-7240 x 1001 | US +1 (201) 206-2592</div></div></div></div></div></div></div></div></div>

</div>