<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<br>
<br>
<div class="moz-cite-prefix">On 01/25/2016 09:11 AM, David Robinson
wrote:<br>
</div>
<blockquote
cite="mid:em809bc756-d377-440b-8d2a-62cbd5ef7a55@dfrobins-vaio"
type="cite">
<style id="eMClientCss">blockquote.cite { margin-left: 5px; margin-right: 0px; padding-left: 10px; padding-right:0px; border-left: 1px solid #cccccc }
blockquote.cite2 {margin-left: 5px; margin-right: 0px; padding-left: 10px; padding-right:0px; border-left: 1px solid #cccccc; margin-top: 3px; padding-top: 0px; }
.plain pre, .plain tt { font-family: monospace; font-size: 100%; font-weight: normal; font-style: normal;}
a img { border: 0px; }body {font-family: Times New Roman;font-size: 12pt;}
.plain pre, .plain tt {font-family: Times New Roman;font-size: 12pt;}
</style>
<style></style>
<div>A lot more than 128-clients. Well over 1000. And, I believe
we might have found the problem and it looks like you were
headed in the right direction as it appears to be a problem with
one of the clients FUSE mounts. </div>
<div> </div>
<div>When we couldn't resolve the issue, I started moving all of
my users off of the gluster storage system as it was no longer
responsive. After moving all of them off, I tried to kill all
of the clients that had homegfs mounted by doing a 'killall
glusterfs' on all of the machines connected to gluster. There
was one machine where even after killing all of the glusterfs
processes and checking to make sure no glusterfs was running,
'mount' still showed the FUSE mount. After I did a 'umount -lf
/homegfs' it finally went away. </div>
<div> </div>
<div>After I killed the client mounts and restarted all of them,
we haven't had any more issues with out of control loads on the
storage systems. We had seen this before with a runaway FUSE
mount, but we found the problem by looking at the load on all of
the clients. The one problem node had an extremely high load
that was out of the norm. When we went to that machine and did
a reset of the FUSE mount, it cleared the problem. In this
case, there was no indication of which of the clients was
causing the issue and the only way to figure it out was to take
the storage system out of production use. </div>
<div> </div>
<div>My understanding is that the FUSE clients writes to both
pairs in the replica at the same time. Does it make sense that
it stopped writing to one of the pairs, and therefore,
everything that was written by that FUSE mount had to be
healed? In a normal scenario, there shouldn't be any (or very
few) heals, right? </div>
<div> </div>
<div>Is there any better way to trace out this issue in the
future? Is there a way to figure out which mount is not
connected properly or which mount is causing all of the heals?
Or, alternatively, is there a way to force all of the clients to
remount without going to all of the clients and killing the
glusterfs process? This obviously becomes difficult in a
scenario when you have thousands of clients connected.</div>
</blockquote>
<br>
You are the only responsive user I know with this kind of setup
where there are a lot of mounts connected to the Volume. Most of the
corner case bugs in the client table expand logic (Which is hit if
we have more than 128 clients) are found by you from Oct-2014 when I
started assisting you :-). Your inputs are valuable here. Please
provide the log file of the bad mount to see what it was doing. I
will think a bit more about the enhancements we need to do to make
debugging easier in your case.<br>
<br>
Pranith<br>
<blockquote
cite="mid:em809bc756-d377-440b-8d2a-62cbd5ef7a55@dfrobins-vaio"
type="cite">
<div> </div>
<div>David</div>
<div> </div>
<div> </div>
<div> </div>
<div> </div>
<div>------ Original Message ------</div>
<div>From: "Pranith Kumar Karampuri" <<a moz-do-not-send="true"
href="mailto:pkarampu@redhat.com">pkarampu@redhat.com</a>></div>
<div>To: "Glomski, Patrick" <<a moz-do-not-send="true"
href="mailto:patrick.glomski@corvidtec.com">patrick.glomski@corvidtec.com</a>></div>
<div>Cc: "David Robinson" <<a moz-do-not-send="true"
href="mailto:drobinson@corvidtec.com">drobinson@corvidtec.com</a>>;
<a class="moz-txt-link-rfc2396E" href="mailto:gluster-users@gluster.org">"gluster-users@gluster.org"</a> <<a moz-do-not-send="true"
href="mailto:gluster-users@gluster.org">gluster-users@gluster.org</a>>;
"Gluster Devel" <<a moz-do-not-send="true"
href="mailto:gluster-devel@gluster.org">gluster-devel@gluster.org</a>></div>
<div>Sent: 1/24/2016 10:22:04 PM</div>
<div>Subject: Re: [Gluster-users] [Gluster-devel] heal hanging</div>
<div> </div>
<div id="xb6f9a08511b04930b21397a929bbbabf" style="COLOR: #000000">
<blockquote class="cite2" cite="56A594DC.6030804@redhat.com"
type="cite">You guys use more than 128 clients don't you? We
recently found a memory corruption in client-table which is
used in locking. I wonder if it has some role to play here.<br>
<a moz-do-not-send="true" class="moz-txt-link-freetext"
href="http://review.gluster.org/13241">http://review.gluster.org/13241</a>
is the fix. Could you see if you are seeing this issue even
after this fix?<br>
<br>
Pranith<br>
<div class="moz-cite-prefix">On 01/22/2016 08:36 AM, Glomski,
Patrick wrote:<br>
</div>
<blockquote class="cite"
cite="mid:CALkMjdCZRYOvhNGOrCFS9v6Y-vOhX2do0HA-N=CpMf1OBo4+dg@mail.gmail.com"
type="cite">
<div dir="ltr">
<div>Pranith, attached are stack traces collected every
second for 20 seconds from the high-%cpu glusterfsd
process.<br>
<br>
</div>
Patrick<br>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Thu, Jan 21, 2016 at 9:46 PM,
Glomski, Patrick <span dir="ltr"><<a
href="mailto:patrick.glomski@corvidtec.com"
moz-do-not-send="true">patrick.glomski@corvidtec.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="PADDING-LEFT:
1ex; BORDER-LEFT: #ccc 1px solid; MARGIN: 0px 0px 0px
0.8ex">
<div dir="ltr">
<div>Last entry for get_real_filename on any of the
bricks was when we turned off the samba gfapi vfs
plugin earlier today:<br>
<br>
/var/log/glusterfs/bricks/data-brick01a-homegfs.log:[2016-01-21
15:13:00.008239] E
[server-rpc-fops.c:768:server_getxattr_cbk]
0-homegfs-server: 105: GETXATTR /wks_backup
(40e582d6-b0c7-4099-ba88-9168a3c32ca6)
(glusterfs.get_real_filename:desktop.ini) ==>
(Permission denied)<br>
<br>
</div>
We'll get back to you with those traces when %cpu
spikes again. As with most sporadic problems, as
soon as you want something out of it, the issue
becomes harder to reproduce.<br>
<div>
<div><br>
</div>
</div>
</div>
<div class="HOEnZb">
<div class="h5">
<div class="gmail_extra"><br>
<div class="gmail_quote">On Thu, Jan 21, 2016 at
9:21 PM, Pranith Kumar Karampuri <span
dir="ltr"><<a
href="mailto:pkarampu@redhat.com"
moz-do-not-send="true">pkarampu@redhat.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote"
style="PADDING-LEFT: 1ex; BORDER-LEFT: #ccc
1px solid; MARGIN: 0px 0px 0px 0.8ex">
<div text="#000000" bgcolor="#FFFFFF"><span><br>
<br>
<div>On 01/22/2016 07:25 AM, Glomski,
Patrick wrote:<br>
</div>
</span><span>
<blockquote class="cite" type="cite">
<div dir="ltr">Unfortunately, all
samba mounts to the gluster volume
through the gfapi vfs plugin have
been disabled for the last 6 hours
or so and frequency of %cpu spikes
is increased. We had switched to
sharing a fuse mount through samba,
but I just disabled that as well.
There are no samba shares of this
volume now. The spikes now happen
every thirty minutes or so. We've
resorted to just rebooting the
machine with high load for the
present.<br>
</div>
</blockquote>
<br>
</span>Could you see if the logs of
following type are not at all coming?<br>
[2016-01-21 15:13:00.005736] E
[server-rpc-fops.c:768:server_getxattr_cbk]
0-homegfs-server: 110: GETXATTR
/wks_backup
(40e582d6-b0c7-4099-ba88-9168a3c<br>
32ca6)
(glusterfs.get_real_filename:desktop.ini)
==> (Permission denied)<br>
<br>
These are operations that failed.
Operations that succeed are the ones that
will scan the directory. But I don't have
a way to find them other than using
tcpdumps.<br>
<br>
At the moment I have 2 theories:<br>
1) these get_real_filename calls<br>
2) [2016-01-21 16:10:38.017828] E
[server-helpers.c:46:gid_resolve]
0-gid-cache: getpwuid_r(494) failed<br>
"<br>
<p class="MsoNormal"><span
style="FONT-SIZE: 11pt; FONT-FAMILY:
"Calibri","sans-serif";
COLOR: #1f497d">Yessir they are.
Normally, sssd would look to the local
cache file in /var/lib/sss/db/ first,
to get any group or userid
information, then go out to the domain
controller. I put the options that we
are using on our GFS volumes below…
Thanks for your help.</span></p>
<p class="MsoNormal"><span
style="FONT-SIZE: 11pt; FONT-FAMILY:
"Calibri","sans-serif";
COLOR: #1f497d"> </span></p>
<p class="MsoNormal"><span
style="FONT-SIZE: 11pt; FONT-FAMILY:
"Calibri","sans-serif";
COLOR: #1f497d">We had been running
sssd with sssd_nss and sssd_be
sub-processes on these systems for a
long time, under the GFS 3.5.2 code,
and not run into the problem that
David described with the high cpu
usage on sssd_nss.</span></p>
<b><span>"<br>
</span></b>That was Tom Young's email
1.5 years back when we debugged it. But
the process which was consuming lot of cpu
is sssd_nss. So I am not sure if it is
same issue. Let us debug to see '1)'
doesn't happen. The gstack traces I asked
for should also help.
<div>
<div><br>
<br>
Pranith<br>
<blockquote class="cite" type="cite">
<div class="gmail_extra"><br>
<div class="gmail_quote">On Thu,
Jan 21, 2016 at 8:49 PM, Pranith
Kumar Karampuri <span dir="ltr"><<a
href="mailto:pkarampu@redhat.com" moz-do-not-send="true">pkarampu@redhat.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote"
style="PADDING-LEFT: 1ex;
BORDER-LEFT: #ccc 1px solid;
MARGIN: 0px 0px 0px 0.8ex">
<div text="#000000"
bgcolor="#FFFFFF"><span><br>
<br>
<div>On 01/22/2016 07:13
AM, Glomski, Patrick
wrote:<br>
</div>
<blockquote class="cite"
type="cite">
<div dir="ltr">We use
the samba glusterfs
virtual filesystem
(the current version
provided on <a
href="http://download.gluster.org/"
moz-do-not-send="true">download.gluster.org</a>), but no windows clients
connecting directly.<br>
</div>
</blockquote>
<br>
</span>Hmm.. Is there a way
to disable using this and
check if the CPU% still
increases? What getxattr of
"glusterfs.get_real_filename
<filanme>" does is to
scan the entire directory
looking for
strcasecmp(<filname>,
<scanned-filename>).
If anything matches then it
will return the
<scanned-filename>.
But the problem is the scan
is costly. So I wonder if
this is the reason for the
CPU spikes.<span><font
color="#888888"><br>
<br>
Pranith</font></span>
<div>
<div><br>
<blockquote class="cite"
type="cite">
<div
class="gmail_extra"><br>
<div
class="gmail_quote">On
Thu, Jan 21, 2016
at 8:37 PM,
Pranith Kumar
Karampuri <span
dir="ltr"><<a
href="mailto:pkarampu@redhat.com" moz-do-not-send="true">pkarampu@redhat.com</a>></span>
wrote:<br>
<blockquote
class="gmail_quote"
style="PADDING-LEFT:
1ex;
BORDER-LEFT:
#ccc 1px solid;
MARGIN: 0px 0px
0px 0.8ex">
<div
text="#000000"
bgcolor="#FFFFFF">Do you have any windows clients? I see a lot of
getxattr calls
for
"glusterfs.get_real_filename"
which lead to
full readdirs
of the
directories on
the brick.<span><font
color="#888888"><br>
<br>
Pranith</font></span><span><br>
<br>
<div>On
01/22/2016
12:51 AM,
Glomski,
Patrick wrote:<br>
</div>
</span>
<div>
<div>
<blockquote
class="cite"
type="cite">
<div dir="ltr">
<div>Pranith,
could this
kind of
behavior be
self-inflicted
by us deleting
files directly
from the
bricks? We
have done that
in the past to
clean up an
issues where
gluster
wouldn't allow
us to delete
from the
mount.<br>
<br>
If so, is it
feasible to
clean them up
by running a
search on the
.glusterfs
directories
directly and
removing files
with a
reference
count of 1
that are
non-zero size
(or directly
checking the
xattrs to be
sure that it's
not a DHT
link). <br>
<br>
find
/data/brick01a/homegfs/.glusterfs
-type f -not
-empty -links
-2 -exec rm -f
"{}" \;<br>
<br>
</div>
Is there
anything I'm
inherently
missing with
that approach
that will
further
corrupt the
system?<br>
<div><br>
</div>
</div>
<div
class="gmail_extra"><br>
<div
class="gmail_quote">On
Thu, Jan 21,
2016 at 1:02
PM, Glomski,
Patrick <span
dir="ltr"><<a
href="mailto:patrick.glomski@corvidtec.com" moz-do-not-send="true">patrick.glomski@corvidtec.com</a>></span>
wrote:<br>
<blockquote
class="gmail_quote"
style="PADDING-LEFT:
1ex;
BORDER-LEFT:
#ccc 1px
solid; MARGIN:
0px 0px 0px
0.8ex">
<div dir="ltr">
<div>
<div>Load
spiked again:
~1200%cpu on
gfs02a for
glusterfsd.
Crawl has been
running on one
of the bricks
on gfs02b for
25 min or so
and users
cannot access
the volume.<br>
<br>
I re-listed
the xattrop
directories as
well as a
'top' entry
and heal
statistics.
Then I
restarted the
gluster
services on
gfs02a. <br>
<br>
===================
top
===================<br>
PID USER
PR NI VIRT
RES SHR S
%CPU %MEM
TIME+
COMMAND
<br>
8969
root 20
0 2815m 204m
3588 S 1181.0
0.6 591:06.93
glusterfsd
<br>
<br>
===================
xattrop
===================<br>
/data/brick01a/homegfs/.glusterfs/indices/xattrop:<br>
xattrop-41f19453-91e4-437c-afa9-3b25614de210
xattrop-9b815879-2f4d-402b-867c-a6d65087788c<br>
<br>
/data/brick02a/homegfs/.glusterfs/indices/xattrop:<br>
xattrop-70131855-3cfb-49af-abce-9d23f57fb393
xattrop-dfb77848-a39d-4417-a725-9beca75d78c6<br>
<br>
/data/brick01b/homegfs/.glusterfs/indices/xattrop:<br>
e6e47ed9-309b-42a7-8c44-28c29b9a20f8
xattrop-5c797a64-bde7-4eac-b4fc-0befc632e125<br>
xattrop-38ec65a1-00b5-4544-8a6c-bf0f531a1934
xattrop-ef0980ad-f074-4163-979f-16d5ef85b0a0<br>
<br>
/data/brick02b/homegfs/.glusterfs/indices/xattrop:<br>
xattrop-7402438d-0ee7-4fcf-b9bb-b561236f99bc
xattrop-8ffbf5f7-ace3-497d-944e-93ac85241413<br>
<br>
/data/brick01a/homegfs/.glusterfs/indices/xattrop:<br>
xattrop-0115acd0-caae-4dfd-b3b4-7cc42a0ff531<br>
<br>
/data/brick02a/homegfs/.glusterfs/indices/xattrop:<br>
xattrop-7e20fdb1-5224-4b9a-be06-568708526d70<br>
<br>
/data/brick01b/homegfs/.glusterfs/indices/xattrop:<br>
8034bc06-92cd-4fa5-8aaf-09039e79d2c8
c9ce22ed-6d8b-471b-a111-b39e57f0b512<br>
94fa1d60-45ad-4341-b69c-315936b51e8d
xattrop-9c04623a-64ce-4f66-8b23-dbaba49119c7<br>
<br>
/data/brick02b/homegfs/.glusterfs/indices/xattrop:<br>
xattrop-b8c8f024-d038-49a2-9a53-c54ead09111d<br>
<br>
<br>
===================
heal stats
===================<br>
<br>
homegfs
[b0-gfsib01a]
: Starting
time of
crawl :
Thu Jan 21
12:36:45 2016<br>
homegfs
[b0-gfsib01a]
: Ending time
of
crawl
: Thu Jan 21
12:36:45 2016<br>
homegfs
[b0-gfsib01a]
: Type of
crawl: INDEX<br>
homegfs
[b0-gfsib01a]
: No. of
entries
healed
: 0<br>
homegfs
[b0-gfsib01a]
: No. of
entries in
split-brain: 0<br>
homegfs
[b0-gfsib01a]
: No. of heal
failed
entries : 0<br>
<br>
homegfs
[b1-gfsib01b]
: Starting
time of
crawl :
Thu Jan 21
12:36:19 2016<br>
homegfs
[b1-gfsib01b]
: Ending time
of
crawl
: Thu Jan 21
12:36:19 2016<br>
homegfs
[b1-gfsib01b]
: Type of
crawl: INDEX<br>
homegfs
[b1-gfsib01b]
: No. of
entries
healed
: 0<br>
homegfs
[b1-gfsib01b]
: No. of
entries in
split-brain: 0<br>
homegfs
[b1-gfsib01b]
: No. of heal
failed
entries : 1<br>
<br>
homegfs
[b2-gfsib01a]
: Starting
time of
crawl :
Thu Jan 21
12:36:48 2016<br>
homegfs
[b2-gfsib01a]
: Ending time
of
crawl
: Thu Jan 21
12:36:48 2016<br>
homegfs
[b2-gfsib01a]
: Type of
crawl: INDEX<br>
homegfs
[b2-gfsib01a]
: No. of
entries
healed
: 0<br>
homegfs
[b2-gfsib01a]
: No. of
entries in
split-brain: 0<br>
homegfs
[b2-gfsib01a]
: No. of heal
failed
entries : 0<br>
<br>
homegfs
[b3-gfsib01b]
: Starting
time of
crawl :
Thu Jan 21
12:36:47 2016<br>
homegfs
[b3-gfsib01b]
: Ending time
of
crawl
: Thu Jan 21
12:36:47 2016<br>
homegfs
[b3-gfsib01b]
: Type of
crawl: INDEX<br>
homegfs
[b3-gfsib01b]
: No. of
entries
healed
: 0<br>
homegfs
[b3-gfsib01b]
: No. of
entries in
split-brain: 0<br>
homegfs
[b3-gfsib01b]
: No. of heal
failed
entries : 0<br>
<br>
homegfs
[b4-gfsib02a]
: Starting
time of
crawl :
Thu Jan 21
12:36:06 2016<br>
homegfs
[b4-gfsib02a]
: Ending time
of
crawl
: Thu Jan 21
12:36:06 2016<br>
homegfs
[b4-gfsib02a]
: Type of
crawl: INDEX<br>
homegfs
[b4-gfsib02a]
: No. of
entries
healed
: 0<br>
homegfs
[b4-gfsib02a]
: No. of
entries in
split-brain: 0<br>
homegfs
[b4-gfsib02a]
: No. of heal
failed
entries : 0<br>
<br>
homegfs
[b5-gfsib02b]
: Starting
time of
crawl :
Thu Jan 21
12:13:40 2016<br>
homegfs
[b5-gfsib02b]
:
*** Crawl is
in progress
***<br>
homegfs
[b5-gfsib02b]
: Type of
crawl: INDEX<br>
homegfs
[b5-gfsib02b]
: No. of
entries
healed
: 0<br>
homegfs
[b5-gfsib02b]
: No. of
entries in
split-brain: 0<br>
homegfs
[b5-gfsib02b]
: No. of heal
failed
entries : 0<br>
<br>
homegfs
[b6-gfsib02a]
: Starting
time of
crawl :
Thu Jan 21
12:36:58 2016<br>
homegfs
[b6-gfsib02a]
: Ending time
of
crawl
: Thu Jan 21
12:36:58 2016<br>
homegfs
[b6-gfsib02a]
: Type of
crawl: INDEX<br>
homegfs
[b6-gfsib02a]
: No. of
entries
healed
: 0<br>
homegfs
[b6-gfsib02a]
: No. of
entries in
split-brain: 0<br>
homegfs
[b6-gfsib02a]
: No. of heal
failed
entries : 0<br>
<br>
homegfs
[b7-gfsib02b]
: Starting
time of
crawl :
Thu Jan 21
12:36:50 2016<br>
homegfs
[b7-gfsib02b]
: Ending time
of
crawl
: Thu Jan 21
12:36:50 2016<br>
homegfs
[b7-gfsib02b]
: Type of
crawl: INDEX<br>
homegfs
[b7-gfsib02b]
: No. of
entries
healed
: 0<br>
homegfs
[b7-gfsib02b]
: No. of
entries in
split-brain: 0<br>
homegfs
[b7-gfsib02b]
: No. of heal
failed
entries : 0<br>
<br>
<br>
========================================================================================<br>
</div>
I waited a few
minutes for
the heals to
finish and ran
the heal
statistics and
info again.
one file is in
split-brain.
Aside from the
split-brain,
the load on
all systems is
down now and
they are
behaving
normally.
glustershd.log
is attached.
What is going
on??? <br>
<br>
Thu Jan 21
12:53:50 EST
2016<br>
<br>
===================
homegfs
===================<br>
<br>
homegfs
[b0-gfsib01a]
: Starting
time of
crawl :
Thu Jan 21
12:53:02 2016<br>
homegfs
[b0-gfsib01a]
: Ending time
of
crawl
: Thu Jan 21
12:53:02 2016<br>
homegfs
[b0-gfsib01a]
: Type of
crawl: INDEX<br>
homegfs
[b0-gfsib01a]
: No. of
entries
healed
: 0<br>
homegfs
[b0-gfsib01a]
: No. of
entries in
split-brain: 0<br>
homegfs
[b0-gfsib01a]
: No. of heal
failed
entries : 0<br>
<br>
homegfs
[b1-gfsib01b]
: Starting
time of
crawl :
Thu Jan 21
12:53:38 2016<br>
homegfs
[b1-gfsib01b]
: Ending time
of
crawl
: Thu Jan 21
12:53:38 2016<br>
homegfs
[b1-gfsib01b]
: Type of
crawl: INDEX<br>
homegfs
[b1-gfsib01b]
: No. of
entries
healed
: 0<br>
homegfs
[b1-gfsib01b]
: No. of
entries in
split-brain: 0<br>
homegfs
[b1-gfsib01b]
: No. of heal
failed
entries : 1<br>
<br>
homegfs
[b2-gfsib01a]
: Starting
time of
crawl :
Thu Jan 21
12:53:04 2016<br>
homegfs
[b2-gfsib01a]
: Ending time
of
crawl
: Thu Jan 21
12:53:04 2016<br>
homegfs
[b2-gfsib01a]
: Type of
crawl: INDEX<br>
homegfs
[b2-gfsib01a]
: No. of
entries
healed
: 0<br>
homegfs
[b2-gfsib01a]
: No. of
entries in
split-brain: 0<br>
homegfs
[b2-gfsib01a]
: No. of heal
failed
entries : 0<br>
<br>
homegfs
[b3-gfsib01b]
: Starting
time of
crawl :
Thu Jan 21
12:53:04 2016<br>
homegfs
[b3-gfsib01b]
: Ending time
of
crawl
: Thu Jan 21
12:53:04 2016<br>
homegfs
[b3-gfsib01b]
: Type of
crawl: INDEX<br>
homegfs
[b3-gfsib01b]
: No. of
entries
healed
: 0<br>
homegfs
[b3-gfsib01b]
: No. of
entries in
split-brain: 0<br>
homegfs
[b3-gfsib01b]
: No. of heal
failed
entries : 0<br>
<br>
homegfs
[b4-gfsib02a]
: Starting
time of
crawl :
Thu Jan 21
12:53:33 2016<br>
homegfs
[b4-gfsib02a]
: Ending time
of
crawl
: Thu Jan 21
12:53:33 2016<br>
homegfs
[b4-gfsib02a]
: Type of
crawl: INDEX<br>
homegfs
[b4-gfsib02a]
: No. of
entries
healed
: 0<br>
homegfs
[b4-gfsib02a]
: No. of
entries in
split-brain: 0<br>
homegfs
[b4-gfsib02a]
: No. of heal
failed
entries : 1<br>
<br>
homegfs
[b5-gfsib02b]
: Starting
time of
crawl :
Thu Jan 21
12:53:14 2016<br>
homegfs
[b5-gfsib02b]
: Ending time
of
crawl
: Thu Jan 21
12:53:15 2016<br>
homegfs
[b5-gfsib02b]
: Type of
crawl: INDEX<br>
homegfs
[b5-gfsib02b]
: No. of
entries
healed
: 0<br>
homegfs
[b5-gfsib02b]
: No. of
entries in
split-brain: 0<br>
homegfs
[b5-gfsib02b]
: No. of heal
failed
entries : 3<br>
<br>
homegfs
[b6-gfsib02a]
: Starting
time of
crawl :
Thu Jan 21
12:53:04 2016<br>
homegfs
[b6-gfsib02a]
: Ending time
of
crawl
: Thu Jan 21
12:53:04 2016<br>
homegfs
[b6-gfsib02a]
: Type of
crawl: INDEX<br>
homegfs
[b6-gfsib02a]
: No. of
entries
healed
: 0<br>
homegfs
[b6-gfsib02a]
: No. of
entries in
split-brain: 0<br>
homegfs
[b6-gfsib02a]
: No. of heal
failed
entries : 0<br>
<br>
homegfs
[b7-gfsib02b]
: Starting
time of
crawl :
Thu Jan 21
12:53:09 2016<br>
homegfs
[b7-gfsib02b]
: Ending time
of
crawl
: Thu Jan 21
12:53:09 2016<br>
homegfs
[b7-gfsib02b]
: Type of
crawl: INDEX<br>
homegfs
[b7-gfsib02b]
: No. of
entries
healed
: 0<br>
homegfs
[b7-gfsib02b]
: No. of
entries in
split-brain: 0<br>
homegfs
[b7-gfsib02b]
: No. of heal
failed
entries : 0<br>
<br>
*** gluster
bug in
'gluster
volume heal
homegfs
statistics'
***<br>
*** Use
'gluster
volume heal
homegfs info'
until bug is
fixed ***<span><br>
<br>
Brick
gfs01a.corvidtec.com:/data/brick01a/homegfs/<br>
Number of
entries: 0<br>
<br>
Brick
gfs01b.corvidtec.com:/data/brick01b/homegfs/<br>
Number of
entries: 0<br>
<br>
Brick
gfs01a.corvidtec.com:/data/brick02a/homegfs/<br>
Number of
entries: 0<br>
<br>
Brick
gfs01b.corvidtec.com:/data/brick02b/homegfs/<br>
Number of
entries: 0<br>
<br>
Brick
gfs02a.corvidtec.com:/data/brick01a/homegfs/<br>
</span>/users/bangell/.gconfd
- Is in
split-brain<br>
<br>
Number of
entries: 1<br>
<br>
Brick
gfs02b.corvidtec.com:/data/brick01b/homegfs/<br>
/users/bangell/.gconfd
- Is in
split-brain<br>
<br>
/users/bangell/.gconfd/saved_state
<br>
Number of
entries: 2<span><br>
<br>
Brick
gfs02a.corvidtec.com:/data/brick02a/homegfs/<br>
Number of
entries: 0<br>
<br>
Brick
gfs02b.corvidtec.com:/data/brick02b/homegfs/<br>
Number of
entries: 0<br>
<br>
</span></div>
<div><br>
<br>
</div>
</div>
<div>
<div>
<div
class="gmail_extra"><br>
<div
class="gmail_quote">On
Thu, Jan 21,
2016 at 11:10
AM, Pranith
Kumar
Karampuri <span
dir="ltr"><<a
href="mailto:pkarampu@redhat.com" moz-do-not-send="true">pkarampu@redhat.com</a>></span>
wrote:<br>
<blockquote
class="gmail_quote"
style="PADDING-LEFT:
1ex;
BORDER-LEFT:
#ccc 1px
solid; MARGIN:
0px 0px 0px
0.8ex">
<div
text="#000000"
bgcolor="#FFFFFF"><span><br>
<br>
<div>On
01/21/2016
09:26 PM,
Glomski,
Patrick wrote:<br>
</div>
<blockquote
class="cite"
type="cite">
<div dir="ltr">
<div>I should
mention that
the problem is
not currently
occurring and
there are no
heals (output
appended). By
restarting the
gluster
services, we
can stop the
crawl, which
lowers the
load for a
while.
Subsequent
crawls seem to
finish
properly. For
what it's
worth,
files/folders
that show up
in the 'volume
info' output
during a hung
crawl don't
seem to be
anything out
of the
ordinary. <br>
<br>
Over the past
four days, the
typical time
before the
problem recurs
after
suppressing it
in this manner
is an hour.
Last night
when we
reached out to
you was the
last time it
happened and
the load has
been low since
(a relief).
David believes
that
recursively
listing the
files (ls -alR
or similar)
from a client
mount can
force the
issue to
happen, but
obviously I'd
rather not
unless we have
some precise
thing we're
looking for.
Let me know if
you'd like me
to attempt to
drive the
system
unstable like
that and what
I should look
for. As it's a
production
system, I'd
rather not
leave it in
this state for
long.<br>
</div>
</div>
</blockquote>
<br>
</span>Will it
be possible to
send
glustershd,
mount logs of
the past 4
days? I would
like to see if
this is
because of
directory
self-heal
going wild
(Ravi is
working on
throttling
feature for
3.8, which
will allow to
put breaks on
self-heal
traffic)<span><font
color="#888888"><br>
<br>
Pranith</font></span>
<div>
<div><br>
<blockquote
class="cite"
type="cite">
<div dir="ltr">
<div><br>
</div>
<div>[root@gfs01a
xattrop]#
gluster volume
heal homegfs
info<br>
Brick
gfs01a.corvidtec.com:/data/brick01a/homegfs/<br>
Number of
entries: 0<br>
<br>
Brick
gfs01b.corvidtec.com:/data/brick01b/homegfs/<br>
Number of
entries: 0<br>
<br>
Brick
gfs01a.corvidtec.com:/data/brick02a/homegfs/<br>
Number of
entries: 0<br>
<br>
Brick
gfs01b.corvidtec.com:/data/brick02b/homegfs/<br>
Number of
entries: 0<br>
<br>
Brick
gfs02a.corvidtec.com:/data/brick01a/homegfs/<br>
Number of
entries: 0<br>
<br>
Brick
gfs02b.corvidtec.com:/data/brick01b/homegfs/<br>
Number of
entries: 0<br>
<br>
Brick
gfs02a.corvidtec.com:/data/brick02a/homegfs/<br>
Number of
entries: 0<br>
<br>
Brick
gfs02b.corvidtec.com:/data/brick02b/homegfs/<br>
Number of
entries: 0<br>
<br>
<br>
<br>
</div>
</div>
<div
class="gmail_extra"><br>
<div
class="gmail_quote">On
Thu, Jan 21,
2016 at 10:40
AM, Pranith
Kumar
Karampuri <span
dir="ltr"><<a
href="mailto:pkarampu@redhat.com" moz-do-not-send="true">pkarampu@redhat.com</a>></span>
wrote:<br>
<blockquote
class="gmail_quote"
style="PADDING-LEFT:
1ex;
BORDER-LEFT:
#ccc 1px
solid; MARGIN:
0px 0px 0px
0.8ex">
<div
text="#000000"
bgcolor="#FFFFFF"><span><br>
<br>
<div>On
01/21/2016
08:25 PM,
Glomski,
Patrick wrote:<br>
</div>
<blockquote
class="cite"
type="cite">
<div dir="ltr">
<div>Hello,
Pranith. The
typical
behavior is
that the %cpu
on a
glusterfsd
process jumps
to number of
processor
cores
available
(800% or
1200%,
depending on
the pair of
nodes
involved) and
the load
average on the
machine goes
very high
(~20). The
volume's heal
statistics
output shows
that it is
crawling one
of the bricks
and trying to
heal, but this
crawl hangs
and never
seems to
finish.<br>
</div>
</div>
</blockquote>
<blockquote
class="cite"
type="cite">
<div dir="ltr">
<div><br>
</div>
The number of
files in the
xattrop
directory
varies over
time, so I ran
a wc -l as you
requested
periodically
for some time
and then
started
including a
datestamped
list of the
files that
were in the
xattrops
directory on
each brick to
see which were
persistent.
All bricks had
files in the
xattrop
folder, so all
results are
attached.<br>
</div>
</blockquote>
</span>Thanks
this info is
helpful. I
don't see a
lot of files.
Could you give
output of
"gluster
volume heal
<volname>
info"? Is
there any
directory in
there which is
LARGE?<span><font
color="#888888"><br>
<br>
Pranith</font></span>
<div>
<div><br>
<blockquote
class="cite"
type="cite">
<div dir="ltr">
<div><br>
</div>
<div>Please
let me know if
there is
anything else
I can provide.<br>
</div>
<div><br>
</div>
<div>Patrick<br>
</div>
<div><br>
</div>
</div>
<div
class="gmail_extra"><br>
<div
class="gmail_quote">On
Thu, Jan 21,
2016 at 12:01
AM, Pranith
Kumar
Karampuri <span
dir="ltr"><<a
href="mailto:pkarampu@redhat.com" moz-do-not-send="true">pkarampu@redhat.com</a>></span>
wrote:<br>
<blockquote
class="gmail_quote"
style="PADDING-LEFT:
1ex;
BORDER-LEFT:
#ccc 1px
solid; MARGIN:
0px 0px 0px
0.8ex">
<div
text="#000000"
bgcolor="#FFFFFF">hey,<br>
Which
process is
consuming so
much cpu? I
went through
the logs you
gave me. I see
that the
following
files are in
gfid mismatch
state:<br>
<br>
<066e4525-8f8b-43aa-b7a1-86bbcecc68b9/safebrowsing-backup>,<br>
<1d48754b-b38c-403d-94e2-0f5c41d5f885/recovery.bak>,<br>
<ddc92637-303a-4059-9c56-ab23b1bb6ae9/patch0008.cnvrg>,<br>
<br>
Could you give
me the output
of "ls
<brick-path>/indices/xattrop
| wc -l"
output on all
the bricks
which are
acting this
way? This will
tell us the
number of
pending
self-heals on
the system.<br>
<br>
Pranith
<div>
<div><br>
<br>
<div>On
01/20/2016
09:26 PM,
David Robinson
wrote:<br>
</div>
</div>
</div>
<blockquote
class="cite"
type="cite">
<div>
<div>
<div>resending
with parsed
logs... </div>
<div> </div>
<div>
<blockquote
class="cite"
cite="http://em5ee26b0e-002a-4230-bdec-3020b98cff3c@dfrobins-vaio"
type="cite">
<div> </div>
<div> </div>
<div>
<blockquote
class="cite"
cite="http://eme3b2cb80-8be2-4fa5-9d08-4710955e237c@dfrobins-vaio"
type="cite">
<div>I am
having issues
with 3.6.6
where the load
will spike up
to 800% for
one of the
glusterfsd
processes and
the users can
no longer
access the
system. If I
reboot the
node, the heal
will finish
normally after
a few minutes
and the system
will be
responsive,
but a few
hours later
the issue will
start again.
It look like
it is hanging
in a heal and
spinning up
the load on
one of the
bricks. The
heal gets
stuck and says
it is crawling
and never
returns.
After a few
minutes of the
heal saying it
is crawling,
the load
spikes up and
the mounts
become
unresponsive.</div>
<div> </div>
<div>Any
suggestions on
how to fix
this? It has
us stopped
cold as the
user can no
longer access
the systems
when the load
spikes... Logs
attached.</div>
<div> </div>
<div>System
setup info is:
</div>
<div> </div>
<div>[root@gfs01a
~]# gluster
volume info
homegfs<br>
<br>
Volume Name:
homegfs<br>
Type:
Distributed-Replicate<br>
Volume ID:
1e32672a-f1b7-4b58-ba94-58c085e59071<br>
Status:
Started<br>
Number of
Bricks: 4 x 2
= 8<br>
Transport-type:
tcp<br>
Bricks:<br>
Brick1:
gfsib01a.corvidtec.com:/data/brick01a/homegfs<br>
Brick2:
gfsib01b.corvidtec.com:/data/brick01b/homegfs<br>
Brick3:
gfsib01a.corvidtec.com:/data/brick02a/homegfs<br>
Brick4:
gfsib01b.corvidtec.com:/data/brick02b/homegfs<br>
Brick5:
gfsib02a.corvidtec.com:/data/brick01a/homegfs<br>
Brick6:
gfsib02b.corvidtec.com:/data/brick01b/homegfs<br>
Brick7:
gfsib02a.corvidtec.com:/data/brick02a/homegfs<br>
Brick8:
gfsib02b.corvidtec.com:/data/brick02b/homegfs<br>
Options
Reconfigured:<br>
performance.io-thread-count:
32<br>
performance.cache-size:
128MB<br>
performance.write-behind-window-size:
128MB<br>
server.allow-insecure:
on<br>
network.ping-timeout:
42<br>
storage.owner-gid:
100<br>
geo-replication.indexing:
off<br>
geo-replication.ignore-pid-check:
on<br>
changelog.changelog:
off<br>
changelog.fsync-interval:
3<br>
changelog.rollover-time:
15<br>
server.manage-gids:
on<br>
diagnostics.client-log-level:
WARNING</div>
<div> </div>
<div>[root@gfs01a
~]# rpm -qa |
grep gluster<br>
gluster-nagios-common-0.1.1-0.el6.noarch<br>
glusterfs-fuse-3.6.6-1.el6.x86_64<br>
glusterfs-debuginfo-3.6.6-1.el6.x86_64<br>
glusterfs-libs-3.6.6-1.el6.x86_64<br>
glusterfs-geo-replication-3.6.6-1.el6.x86_64<br>
glusterfs-api-3.6.6-1.el6.x86_64<br>
glusterfs-devel-3.6.6-1.el6.x86_64<br>
glusterfs-api-devel-3.6.6-1.el6.x86_64<br>
glusterfs-3.6.6-1.el6.x86_64<br>
glusterfs-cli-3.6.6-1.el6.x86_64<br>
glusterfs-rdma-3.6.6-1.el6.x86_64<br>
samba-vfs-glusterfs-4.1.11-2.el6.x86_64<br>
glusterfs-server-3.6.6-1.el6.x86_64<br>
glusterfs-extra-xlators-3.6.6-1.el6.x86_64<br>
</div>
<div> </div>
<div>
<div
style="FONT-SIZE:
12pt;
FONT-FAMILY:
Times
New
Roman"><span><span>
<div> </div>
</span></span></div>
</div>
</blockquote>
</div>
</blockquote>
</div>
<br>
<fieldset></fieldset>
<br>
</div>
</div>
<pre>_______________________________________________
Gluster-devel mailing list
<a href="mailto:Gluster-devel@gluster.org" moz-do-not-send="true">Gluster-devel@gluster.org</a>
<a href="http://www.gluster.org/mailman/listinfo/gluster-devel" moz-do-not-send="true">http://www.gluster.org/mailman/listinfo/gluster-devel</a></pre>
</blockquote>
<br>
</div>
<br>
_______________________________________________<br>
Gluster-users
mailing list<br>
<a
href="mailto:Gluster-users@gluster.org"
moz-do-not-send="true">Gluster-users@gluster.org</a><br>
<a
href="http://www.gluster.org/mailman/listinfo/gluster-users"
rel="noreferrer" moz-do-not-send="true">http://www.gluster.org/mailman/listinfo/gluster-users</a><br>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
</blockquote>
</div>
</blockquote>
<br>
</body>
</html>