<html>
<head>
<meta content="text/html; charset=windows-1252"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
You guys use more than 128 clients don't you? We recently found a
memory corruption in client-table which is used in locking. I wonder
if it has some role to play here.<br>
<a class="moz-txt-link-freetext" href="http://review.gluster.org/13241">http://review.gluster.org/13241</a> is the fix. Could you see if you are
seeing this issue even after this fix?<br>
<br>
Pranith<br>
<div class="moz-cite-prefix">On 01/22/2016 08:36 AM, Glomski,
Patrick wrote:<br>
</div>
<blockquote
cite="mid:CALkMjdCZRYOvhNGOrCFS9v6Y-vOhX2do0HA-N=CpMf1OBo4+dg@mail.gmail.com"
type="cite">
<div dir="ltr">
<div>Pranith, attached are stack traces collected every second
for 20 seconds from the high-%cpu glusterfsd process.<br>
<br>
</div>
Patrick<br>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Thu, Jan 21, 2016 at 9:46 PM,
Glomski, Patrick <span dir="ltr"><<a
moz-do-not-send="true"
href="mailto:patrick.glomski@corvidtec.com"
target="_blank">patrick.glomski@corvidtec.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">
<div>Last entry for get_real_filename on any of the bricks
was when we turned off the samba gfapi vfs plugin
earlier today:<br>
<br>
/var/log/glusterfs/bricks/data-brick01a-homegfs.log:[2016-01-21
15:13:00.008239] E
[server-rpc-fops.c:768:server_getxattr_cbk]
0-homegfs-server: 105: GETXATTR /wks_backup
(40e582d6-b0c7-4099-ba88-9168a3c32ca6)
(glusterfs.get_real_filename:desktop.ini) ==>
(Permission denied)<br>
<br>
</div>
We'll get back to you with those traces when %cpu spikes
again. As with most sporadic problems, as soon as you want
something out of it, the issue becomes harder to
reproduce.<br>
<div>
<div><br>
</div>
</div>
</div>
<div class="HOEnZb">
<div class="h5">
<div class="gmail_extra"><br>
<div class="gmail_quote">On Thu, Jan 21, 2016 at 9:21
PM, Pranith Kumar Karampuri <span dir="ltr"><<a
moz-do-not-send="true"
href="mailto:pkarampu@redhat.com"
target="_blank">pkarampu@redhat.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"><span> <br>
<br>
<div>On 01/22/2016 07:25 AM, Glomski, Patrick
wrote:<br>
</div>
</span><span>
<blockquote type="cite">
<div dir="ltr">Unfortunately, all samba
mounts to the gluster volume through the
gfapi vfs plugin have been disabled for
the last 6 hours or so and frequency of
%cpu spikes is increased. We had switched
to sharing a fuse mount through samba, but
I just disabled that as well. There are no
samba shares of this volume now. The
spikes now happen every thirty minutes or
so. We've resorted to just rebooting the
machine with high load for the present.<br>
</div>
</blockquote>
<br>
</span> Could you see if the logs of following
type are not at all coming?<br>
[2016-01-21 15:13:00.005736] E
[server-rpc-fops.c:768:server_getxattr_cbk]
0-homegfs-server: 110: GETXATTR /wks_backup
(40e582d6-b0c7-4099-ba88-9168a3c<br>
32ca6) (glusterfs.get_real_filename:desktop.ini)
==> (Permission denied)<br>
<br>
These are operations that failed. Operations
that succeed are the ones that will scan the
directory. But I don't have a way to find them
other than using tcpdumps.<br>
<br>
At the moment I have 2 theories:<br>
1) these get_real_filename calls<br>
2) [2016-01-21 16:10:38.017828] E
[server-helpers.c:46:gid_resolve] 0-gid-cache:
getpwuid_r(494) failed<br>
"<br>
<p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">Yessir
they are. Normally, sssd would look to the
local cache file in /var/lib/sss/db/ first,
to get any group or userid information, then
go out to the domain controller. I put the
options that we are using on our GFS volumes
below… Thanks for your help.</span></p>
<p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"> </span></p>
<p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">We
had been running sssd with sssd_nss and
sssd_be sub-processes on these systems for a
long time, under the GFS 3.5.2 code, and not
run into the problem that David described
with the high cpu usage on sssd_nss.</span></p>
<b><span>"<br>
</span></b>That was Tom Young's email 1.5
years back when we debugged it. But the process
which was consuming lot of cpu is sssd_nss. So I
am not sure if it is same issue. Let us debug to
see '1)' doesn't happen. The gstack traces I
asked for should also help.
<div>
<div><br>
<br>
Pranith<br>
<blockquote type="cite">
<div class="gmail_extra"><br>
<div class="gmail_quote">On Thu, Jan 21,
2016 at 8:49 PM, Pranith Kumar
Karampuri <span dir="ltr"><<a
moz-do-not-send="true"
href="mailto:pkarampu@redhat.com"
target="_blank">pkarampu@redhat.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
<div bgcolor="#FFFFFF"
text="#000000"><span> <br>
<br>
<div>On 01/22/2016 07:13 AM,
Glomski, Patrick wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">We use the
samba glusterfs virtual
filesystem (the current
version provided on <a
moz-do-not-send="true"
href="http://download.gluster.org"
target="_blank">download.gluster.org</a>),
but no windows clients
connecting directly.<br>
</div>
</blockquote>
<br>
</span> Hmm.. Is there a way to
disable using this and check if
the CPU% still increases? What
getxattr of
"glusterfs.get_real_filename
<filanme>" does is to scan
the entire directory looking for
strcasecmp(<filname>,
<scanned-filename>). If
anything matches then it will
return the
<scanned-filename>. But the
problem is the scan is costly. So
I wonder if this is the reason for
the CPU spikes.<span><font
color="#888888"><br>
<br>
Pranith</font></span>
<div>
<div><br>
<blockquote type="cite">
<div class="gmail_extra"><br>
<div class="gmail_quote">On
Thu, Jan 21, 2016 at
8:37 PM, Pranith Kumar
Karampuri <span
dir="ltr"><<a
moz-do-not-send="true"
href="mailto:pkarampu@redhat.com" target="_blank">pkarampu@redhat.com</a>></span>
wrote:<br>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
<div bgcolor="#FFFFFF"
text="#000000"> Do
you have any windows
clients? I see a lot
of getxattr calls
for
"glusterfs.get_real_filename"
which lead to full
readdirs of the
directories on the
brick.<span><font
color="#888888"><br>
<br>
Pranith</font></span><span><br>
<br>
<div>On 01/22/2016
12:51 AM,
Glomski, Patrick
wrote:<br>
</div>
</span>
<div>
<div>
<blockquote
type="cite">
<div dir="ltr">
<div>Pranith,
could this
kind of
behavior be
self-inflicted
by us deleting
files directly
from the
bricks? We
have done that
in the past to
clean up an
issues where
gluster
wouldn't allow
us to delete
from the
mount.<br>
<br>
If so, is it
feasible to
clean them up
by running a
search on the
.glusterfs
directories
directly and
removing files
with a
reference
count of 1
that are
non-zero size
(or directly
checking the
xattrs to be
sure that it's
not a DHT
link). <br>
<br>
find
/data/brick01a/homegfs/.glusterfs
-type f -not
-empty -links
-2 -exec rm -f
"{}" \;<br>
<br>
</div>
Is there
anything I'm
inherently
missing with
that approach
that will
further
corrupt the
system?<br>
<div><br>
</div>
</div>
<div
class="gmail_extra"><br>
<div
class="gmail_quote">On
Thu, Jan 21,
2016 at 1:02
PM, Glomski,
Patrick <span
dir="ltr"><<a
moz-do-not-send="true" href="mailto:patrick.glomski@corvidtec.com"
target="_blank">patrick.glomski@corvidtec.com</a>></span>
wrote:<br>
<blockquote
class="gmail_quote"
style="margin:0
0 0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
<div dir="ltr">
<div>
<div>Load
spiked again:
~1200%cpu on
gfs02a for
glusterfsd.
Crawl has been
running on one
of the bricks
on gfs02b for
25 min or so
and users
cannot access
the volume.<br>
<br>
I re-listed
the xattrop
directories as
well as a
'top' entry
and heal
statistics.
Then I
restarted the
gluster
services on
gfs02a. <br>
<br>
===================
top
===================<br>
PID USER
PR NI VIRT
RES SHR S
%CPU %MEM
TIME+
COMMAND
<br>
8969
root 20
0 2815m 204m
3588 S 1181.0
0.6 591:06.93
glusterfsd
<br>
<br>
===================
xattrop
===================<br>
/data/brick01a/homegfs/.glusterfs/indices/xattrop:<br>
xattrop-41f19453-91e4-437c-afa9-3b25614de210
xattrop-9b815879-2f4d-402b-867c-a6d65087788c<br>
<br>
/data/brick02a/homegfs/.glusterfs/indices/xattrop:<br>
xattrop-70131855-3cfb-49af-abce-9d23f57fb393
xattrop-dfb77848-a39d-4417-a725-9beca75d78c6<br>
<br>
/data/brick01b/homegfs/.glusterfs/indices/xattrop:<br>
e6e47ed9-309b-42a7-8c44-28c29b9a20f8
xattrop-5c797a64-bde7-4eac-b4fc-0befc632e125<br>
xattrop-38ec65a1-00b5-4544-8a6c-bf0f531a1934
xattrop-ef0980ad-f074-4163-979f-16d5ef85b0a0<br>
<br>
/data/brick02b/homegfs/.glusterfs/indices/xattrop:<br>
xattrop-7402438d-0ee7-4fcf-b9bb-b561236f99bc
xattrop-8ffbf5f7-ace3-497d-944e-93ac85241413<br>
<br>
/data/brick01a/homegfs/.glusterfs/indices/xattrop:<br>
xattrop-0115acd0-caae-4dfd-b3b4-7cc42a0ff531<br>
<br>
/data/brick02a/homegfs/.glusterfs/indices/xattrop:<br>
xattrop-7e20fdb1-5224-4b9a-be06-568708526d70<br>
<br>
/data/brick01b/homegfs/.glusterfs/indices/xattrop:<br>
8034bc06-92cd-4fa5-8aaf-09039e79d2c8
c9ce22ed-6d8b-471b-a111-b39e57f0b512<br>
94fa1d60-45ad-4341-b69c-315936b51e8d
xattrop-9c04623a-64ce-4f66-8b23-dbaba49119c7<br>
<br>
/data/brick02b/homegfs/.glusterfs/indices/xattrop:<br>
xattrop-b8c8f024-d038-49a2-9a53-c54ead09111d<br>
<br>
<br>
===================
heal stats
===================<br>
<br>
homegfs
[b0-gfsib01a]
: Starting
time of
crawl :
Thu Jan 21
12:36:45 2016<br>
homegfs
[b0-gfsib01a]
: Ending time
of
crawl
: Thu Jan 21
12:36:45 2016<br>
homegfs
[b0-gfsib01a]
: Type of
crawl: INDEX<br>
homegfs
[b0-gfsib01a]
: No. of
entries
healed
: 0<br>
homegfs
[b0-gfsib01a]
: No. of
entries in
split-brain: 0<br>
homegfs
[b0-gfsib01a]
: No. of heal
failed
entries : 0<br>
<br>
homegfs
[b1-gfsib01b]
: Starting
time of
crawl :
Thu Jan 21
12:36:19 2016<br>
homegfs
[b1-gfsib01b]
: Ending time
of
crawl
: Thu Jan 21
12:36:19 2016<br>
homegfs
[b1-gfsib01b]
: Type of
crawl: INDEX<br>
homegfs
[b1-gfsib01b]
: No. of
entries
healed
: 0<br>
homegfs
[b1-gfsib01b]
: No. of
entries in
split-brain: 0<br>
homegfs
[b1-gfsib01b]
: No. of heal
failed
entries : 1<br>
<br>
homegfs
[b2-gfsib01a]
: Starting
time of
crawl :
Thu Jan 21
12:36:48 2016<br>
homegfs
[b2-gfsib01a]
: Ending time
of
crawl
: Thu Jan 21
12:36:48 2016<br>
homegfs
[b2-gfsib01a]
: Type of
crawl: INDEX<br>
homegfs
[b2-gfsib01a]
: No. of
entries
healed
: 0<br>
homegfs
[b2-gfsib01a]
: No. of
entries in
split-brain: 0<br>
homegfs
[b2-gfsib01a]
: No. of heal
failed
entries : 0<br>
<br>
homegfs
[b3-gfsib01b]
: Starting
time of
crawl :
Thu Jan 21
12:36:47 2016<br>
homegfs
[b3-gfsib01b]
: Ending time
of
crawl
: Thu Jan 21
12:36:47 2016<br>
homegfs
[b3-gfsib01b]
: Type of
crawl: INDEX<br>
homegfs
[b3-gfsib01b]
: No. of
entries
healed
: 0<br>
homegfs
[b3-gfsib01b]
: No. of
entries in
split-brain: 0<br>
homegfs
[b3-gfsib01b]
: No. of heal
failed
entries : 0<br>
<br>
homegfs
[b4-gfsib02a]
: Starting
time of
crawl :
Thu Jan 21
12:36:06 2016<br>
homegfs
[b4-gfsib02a]
: Ending time
of
crawl
: Thu Jan 21
12:36:06 2016<br>
homegfs
[b4-gfsib02a]
: Type of
crawl: INDEX<br>
homegfs
[b4-gfsib02a]
: No. of
entries
healed
: 0<br>
homegfs
[b4-gfsib02a]
: No. of
entries in
split-brain: 0<br>
homegfs
[b4-gfsib02a]
: No. of heal
failed
entries : 0<br>
<br>
homegfs
[b5-gfsib02b]
: Starting
time of
crawl :
Thu Jan 21
12:13:40 2016<br>
homegfs
[b5-gfsib02b]
:
*** Crawl is
in progress
***<br>
homegfs
[b5-gfsib02b]
: Type of
crawl: INDEX<br>
homegfs
[b5-gfsib02b]
: No. of
entries
healed
: 0<br>
homegfs
[b5-gfsib02b]
: No. of
entries in
split-brain: 0<br>
homegfs
[b5-gfsib02b]
: No. of heal
failed
entries : 0<br>
<br>
homegfs
[b6-gfsib02a]
: Starting
time of
crawl :
Thu Jan 21
12:36:58 2016<br>
homegfs
[b6-gfsib02a]
: Ending time
of
crawl
: Thu Jan 21
12:36:58 2016<br>
homegfs
[b6-gfsib02a]
: Type of
crawl: INDEX<br>
homegfs
[b6-gfsib02a]
: No. of
entries
healed
: 0<br>
homegfs
[b6-gfsib02a]
: No. of
entries in
split-brain: 0<br>
homegfs
[b6-gfsib02a]
: No. of heal
failed
entries : 0<br>
<br>
homegfs
[b7-gfsib02b]
: Starting
time of
crawl :
Thu Jan 21
12:36:50 2016<br>
homegfs
[b7-gfsib02b]
: Ending time
of
crawl
: Thu Jan 21
12:36:50 2016<br>
homegfs
[b7-gfsib02b]
: Type of
crawl: INDEX<br>
homegfs
[b7-gfsib02b]
: No. of
entries
healed
: 0<br>
homegfs
[b7-gfsib02b]
: No. of
entries in
split-brain: 0<br>
homegfs
[b7-gfsib02b]
: No. of heal
failed
entries : 0<br>
<br>
<br>
========================================================================================<br>
</div>
I waited a few
minutes for
the heals to
finish and ran
the heal
statistics and
info again.
one file is in
split-brain.
Aside from the
split-brain,
the load on
all systems is
down now and
they are
behaving
normally.
glustershd.log
is attached.
What is going
on??? <br>
<br>
Thu Jan 21
12:53:50 EST
2016<br>
<br>
===================
homegfs
===================<br>
<br>
homegfs
[b0-gfsib01a]
: Starting
time of
crawl :
Thu Jan 21
12:53:02 2016<br>
homegfs
[b0-gfsib01a]
: Ending time
of
crawl
: Thu Jan 21
12:53:02 2016<br>
homegfs
[b0-gfsib01a]
: Type of
crawl: INDEX<br>
homegfs
[b0-gfsib01a]
: No. of
entries
healed
: 0<br>
homegfs
[b0-gfsib01a]
: No. of
entries in
split-brain: 0<br>
homegfs
[b0-gfsib01a]
: No. of heal
failed
entries : 0<br>
<br>
homegfs
[b1-gfsib01b]
: Starting
time of
crawl :
Thu Jan 21
12:53:38 2016<br>
homegfs
[b1-gfsib01b]
: Ending time
of
crawl
: Thu Jan 21
12:53:38 2016<br>
homegfs
[b1-gfsib01b]
: Type of
crawl: INDEX<br>
homegfs
[b1-gfsib01b]
: No. of
entries
healed
: 0<br>
homegfs
[b1-gfsib01b]
: No. of
entries in
split-brain: 0<br>
homegfs
[b1-gfsib01b]
: No. of heal
failed
entries : 1<br>
<br>
homegfs
[b2-gfsib01a]
: Starting
time of
crawl :
Thu Jan 21
12:53:04 2016<br>
homegfs
[b2-gfsib01a]
: Ending time
of
crawl
: Thu Jan 21
12:53:04 2016<br>
homegfs
[b2-gfsib01a]
: Type of
crawl: INDEX<br>
homegfs
[b2-gfsib01a]
: No. of
entries
healed
: 0<br>
homegfs
[b2-gfsib01a]
: No. of
entries in
split-brain: 0<br>
homegfs
[b2-gfsib01a]
: No. of heal
failed
entries : 0<br>
<br>
homegfs
[b3-gfsib01b]
: Starting
time of
crawl :
Thu Jan 21
12:53:04 2016<br>
homegfs
[b3-gfsib01b]
: Ending time
of
crawl
: Thu Jan 21
12:53:04 2016<br>
homegfs
[b3-gfsib01b]
: Type of
crawl: INDEX<br>
homegfs
[b3-gfsib01b]
: No. of
entries
healed
: 0<br>
homegfs
[b3-gfsib01b]
: No. of
entries in
split-brain: 0<br>
homegfs
[b3-gfsib01b]
: No. of heal
failed
entries : 0<br>
<br>
homegfs
[b4-gfsib02a]
: Starting
time of
crawl :
Thu Jan 21
12:53:33 2016<br>
homegfs
[b4-gfsib02a]
: Ending time
of
crawl
: Thu Jan 21
12:53:33 2016<br>
homegfs
[b4-gfsib02a]
: Type of
crawl: INDEX<br>
homegfs
[b4-gfsib02a]
: No. of
entries
healed
: 0<br>
homegfs
[b4-gfsib02a]
: No. of
entries in
split-brain: 0<br>
homegfs
[b4-gfsib02a]
: No. of heal
failed
entries : 1<br>
<br>
homegfs
[b5-gfsib02b]
: Starting
time of
crawl :
Thu Jan 21
12:53:14 2016<br>
homegfs
[b5-gfsib02b]
: Ending time
of
crawl
: Thu Jan 21
12:53:15 2016<br>
homegfs
[b5-gfsib02b]
: Type of
crawl: INDEX<br>
homegfs
[b5-gfsib02b]
: No. of
entries
healed
: 0<br>
homegfs
[b5-gfsib02b]
: No. of
entries in
split-brain: 0<br>
homegfs
[b5-gfsib02b]
: No. of heal
failed
entries : 3<br>
<br>
homegfs
[b6-gfsib02a]
: Starting
time of
crawl :
Thu Jan 21
12:53:04 2016<br>
homegfs
[b6-gfsib02a]
: Ending time
of
crawl
: Thu Jan 21
12:53:04 2016<br>
homegfs
[b6-gfsib02a]
: Type of
crawl: INDEX<br>
homegfs
[b6-gfsib02a]
: No. of
entries
healed
: 0<br>
homegfs
[b6-gfsib02a]
: No. of
entries in
split-brain: 0<br>
homegfs
[b6-gfsib02a]
: No. of heal
failed
entries : 0<br>
<br>
homegfs
[b7-gfsib02b]
: Starting
time of
crawl :
Thu Jan 21
12:53:09 2016<br>
homegfs
[b7-gfsib02b]
: Ending time
of
crawl
: Thu Jan 21
12:53:09 2016<br>
homegfs
[b7-gfsib02b]
: Type of
crawl: INDEX<br>
homegfs
[b7-gfsib02b]
: No. of
entries
healed
: 0<br>
homegfs
[b7-gfsib02b]
: No. of
entries in
split-brain: 0<br>
homegfs
[b7-gfsib02b]
: No. of heal
failed
entries : 0<br>
<br>
*** gluster
bug in
'gluster
volume heal
homegfs
statistics'
***<br>
*** Use
'gluster
volume heal
homegfs info'
until bug is
fixed ***<span><br>
<br>
Brick
gfs01a.corvidtec.com:/data/brick01a/homegfs/<br>
Number of
entries: 0<br>
<br>
Brick
gfs01b.corvidtec.com:/data/brick01b/homegfs/<br>
Number of
entries: 0<br>
<br>
Brick
gfs01a.corvidtec.com:/data/brick02a/homegfs/<br>
Number of
entries: 0<br>
<br>
Brick
gfs01b.corvidtec.com:/data/brick02b/homegfs/<br>
Number of
entries: 0<br>
<br>
Brick
gfs02a.corvidtec.com:/data/brick01a/homegfs/<br>
</span>/users/bangell/.gconfd
- Is in
split-brain<br>
<br>
Number of
entries: 1<br>
<br>
Brick
gfs02b.corvidtec.com:/data/brick01b/homegfs/<br>
/users/bangell/.gconfd
- Is in
split-brain<br>
<br>
/users/bangell/.gconfd/saved_state
<br>
Number of
entries: 2<span><br>
<br>
Brick
gfs02a.corvidtec.com:/data/brick02a/homegfs/<br>
Number of
entries: 0<br>
<br>
Brick
gfs02b.corvidtec.com:/data/brick02b/homegfs/<br>
Number of
entries: 0<br>
<br>
</span></div>
<div><br>
<br>
</div>
</div>
<div>
<div>
<div
class="gmail_extra"><br>
<div
class="gmail_quote">On
Thu, Jan 21,
2016 at 11:10
AM, Pranith
Kumar
Karampuri <span
dir="ltr"><<a
moz-do-not-send="true" href="mailto:pkarampu@redhat.com" target="_blank">pkarampu@redhat.com</a>></span>
wrote:<br>
<blockquote
class="gmail_quote"
style="margin:0
0 0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
<div
bgcolor="#FFFFFF"
text="#000000"><span>
<br>
<br>
<div>On
01/21/2016
09:26 PM,
Glomski,
Patrick wrote:<br>
</div>
<blockquote
type="cite">
<div dir="ltr">
<div>I should
mention that
the problem is
not currently
occurring and
there are no
heals (output
appended). By
restarting the
gluster
services, we
can stop the
crawl, which
lowers the
load for a
while.
Subsequent
crawls seem to
finish
properly. For
what it's
worth,
files/folders
that show up
in the 'volume
info' output
during a hung
crawl don't
seem to be
anything out
of the
ordinary. <br>
<br>
Over the past
four days, the
typical time
before the
problem recurs
after
suppressing it
in this manner
is an hour.
Last night
when we
reached out to
you was the
last time it
happened and
the load has
been low since
(a relief).
David believes
that
recursively
listing the
files (ls -alR
or similar)
from a client
mount can
force the
issue to
happen, but
obviously I'd
rather not
unless we have
some precise
thing we're
looking for.
Let me know if
you'd like me
to attempt to
drive the
system
unstable like
that and what
I should look
for. As it's a
production
system, I'd
rather not
leave it in
this state for
long.<br>
</div>
</div>
</blockquote>
<br>
</span> Will
it be possible
to send
glustershd,
mount logs of
the past 4
days? I would
like to see if
this is
because of
directory
self-heal
going wild
(Ravi is
working on
throttling
feature for
3.8, which
will allow to
put breaks on
self-heal
traffic)<span><font
color="#888888"><br>
<br>
Pranith</font></span>
<div>
<div><br>
<blockquote
type="cite">
<div dir="ltr">
<div><br>
</div>
<div>[root@gfs01a
xattrop]#
gluster volume
heal homegfs
info<br>
Brick
gfs01a.corvidtec.com:/data/brick01a/homegfs/<br>
Number of
entries: 0<br>
<br>
Brick
gfs01b.corvidtec.com:/data/brick01b/homegfs/<br>
Number of
entries: 0<br>
<br>
Brick
gfs01a.corvidtec.com:/data/brick02a/homegfs/<br>
Number of
entries: 0<br>
<br>
Brick
gfs01b.corvidtec.com:/data/brick02b/homegfs/<br>
Number of
entries: 0<br>
<br>
Brick
gfs02a.corvidtec.com:/data/brick01a/homegfs/<br>
Number of
entries: 0<br>
<br>
Brick
gfs02b.corvidtec.com:/data/brick01b/homegfs/<br>
Number of
entries: 0<br>
<br>
Brick
gfs02a.corvidtec.com:/data/brick02a/homegfs/<br>
Number of
entries: 0<br>
<br>
Brick
gfs02b.corvidtec.com:/data/brick02b/homegfs/<br>
Number of
entries: 0<br>
<br>
<br>
<br>
</div>
</div>
<div
class="gmail_extra"><br>
<div
class="gmail_quote">On
Thu, Jan 21,
2016 at 10:40
AM, Pranith
Kumar
Karampuri <span
dir="ltr"><<a
moz-do-not-send="true" href="mailto:pkarampu@redhat.com" target="_blank">pkarampu@redhat.com</a>></span>
wrote:<br>
<blockquote
class="gmail_quote"
style="margin:0
0 0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
<div
bgcolor="#FFFFFF"
text="#000000"><span>
<br>
<br>
<div>On
01/21/2016
08:25 PM,
Glomski,
Patrick wrote:<br>
</div>
<blockquote
type="cite">
<div dir="ltr">
<div>Hello,
Pranith. The
typical
behavior is
that the %cpu
on a
glusterfsd
process jumps
to number of
processor
cores
available
(800% or
1200%,
depending on
the pair of
nodes
involved) and
the load
average on the
machine goes
very high
(~20). The
volume's heal
statistics
output shows
that it is
crawling one
of the bricks
and trying to
heal, but this
crawl hangs
and never
seems to
finish.<br>
</div>
</div>
</blockquote>
<blockquote
type="cite">
<div dir="ltr">
<div><br>
</div>
The number of
files in the
xattrop
directory
varies over
time, so I ran
a wc -l as you
requested
periodically
for some time
and then
started
including a
datestamped
list of the
files that
were in the
xattrops
directory on
each brick to
see which were
persistent.
All bricks had
files in the
xattrop
folder, so all
results are
attached.<br>
</div>
</blockquote>
</span> Thanks
this info is
helpful. I
don't see a
lot of files.
Could you give
output of
"gluster
volume heal
<volname>
info"? Is
there any
directory in
there which is
LARGE?<span><font
color="#888888"><br>
<br>
Pranith</font></span>
<div>
<div><br>
<blockquote
type="cite">
<div dir="ltr">
<div><br>
</div>
<div>Please
let me know if
there is
anything else
I can provide.<br>
</div>
<div><br>
</div>
<div>Patrick<br>
</div>
<div><br>
</div>
</div>
<div
class="gmail_extra"><br>
<div
class="gmail_quote">On
Thu, Jan 21,
2016 at 12:01
AM, Pranith
Kumar
Karampuri <span
dir="ltr"><<a
moz-do-not-send="true" href="mailto:pkarampu@redhat.com" target="_blank">pkarampu@redhat.com</a>></span>
wrote:<br>
<blockquote
class="gmail_quote"
style="margin:0
0 0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
<div
bgcolor="#FFFFFF"
text="#000000">
hey,<br>
Which
process is
consuming so
much cpu? I
went through
the logs you
gave me. I see
that the
following
files are in
gfid mismatch
state:<br>
<br>
<066e4525-8f8b-43aa-b7a1-86bbcecc68b9/safebrowsing-backup>,<br>
<1d48754b-b38c-403d-94e2-0f5c41d5f885/recovery.bak>,<br>
<ddc92637-303a-4059-9c56-ab23b1bb6ae9/patch0008.cnvrg>,<br>
<br>
Could you give
me the output
of "ls
<brick-path>/indices/xattrop
| wc -l"
output on all
the bricks
which are
acting this
way? This will
tell us the
number of
pending
self-heals on
the system.<br>
<br>
Pranith
<div>
<div><br>
<br>
<div>On
01/20/2016
09:26 PM,
David Robinson
wrote:<br>
</div>
</div>
</div>
<blockquote
type="cite">
<div>
<div>
<div>resending
with parsed
logs... </div>
<div> </div>
<div>
<blockquote
cite="http://em5ee26b0e-002a-4230-bdec-3020b98cff3c@dfrobins-vaio"
type="cite">
<div> </div>
<div> </div>
<div>
<blockquote
cite="http://eme3b2cb80-8be2-4fa5-9d08-4710955e237c@dfrobins-vaio"
type="cite">
<div>I am
having issues
with 3.6.6
where the load
will spike up
to 800% for
one of the
glusterfsd
processes and
the users can
no longer
access the
system. If I
reboot the
node, the heal
will finish
normally after
a few minutes
and the system
will be
responsive,
but a few
hours later
the issue will
start again.
It look like
it is hanging
in a heal and
spinning up
the load on
one of the
bricks. The
heal gets
stuck and says
it is crawling
and never
returns.
After a few
minutes of the
heal saying it
is crawling,
the load
spikes up and
the mounts
become
unresponsive.</div>
<div> </div>
<div>Any
suggestions on
how to fix
this? It has
us stopped
cold as the
user can no
longer access
the systems
when the load
spikes... Logs
attached.</div>
<div> </div>
<div>System
setup info is:
</div>
<div> </div>
<div>[root@gfs01a
~]# gluster
volume info
homegfs<br>
<br>
Volume Name:
homegfs<br>
Type:
Distributed-Replicate<br>
Volume ID:
1e32672a-f1b7-4b58-ba94-58c085e59071<br>
Status:
Started<br>
Number of
Bricks: 4 x 2
= 8<br>
Transport-type:
tcp<br>
Bricks:<br>
Brick1:
gfsib01a.corvidtec.com:/data/brick01a/homegfs<br>
Brick2:
gfsib01b.corvidtec.com:/data/brick01b/homegfs<br>
Brick3:
gfsib01a.corvidtec.com:/data/brick02a/homegfs<br>
Brick4:
gfsib01b.corvidtec.com:/data/brick02b/homegfs<br>
Brick5:
gfsib02a.corvidtec.com:/data/brick01a/homegfs<br>
Brick6:
gfsib02b.corvidtec.com:/data/brick01b/homegfs<br>
Brick7:
gfsib02a.corvidtec.com:/data/brick02a/homegfs<br>
Brick8:
gfsib02b.corvidtec.com:/data/brick02b/homegfs<br>
Options
Reconfigured:<br>
performance.io-thread-count:
32<br>
performance.cache-size:
128MB<br>
performance.write-behind-window-size:
128MB<br>
server.allow-insecure:
on<br>
network.ping-timeout:
42<br>
storage.owner-gid:
100<br>
geo-replication.indexing:
off<br>
geo-replication.ignore-pid-check:
on<br>
changelog.changelog:
off<br>
changelog.fsync-interval:
3<br>
changelog.rollover-time:
15<br>
server.manage-gids:
on<br>
diagnostics.client-log-level:
WARNING</div>
<div> </div>
<div>[root@gfs01a
~]# rpm -qa |
grep gluster<br>
gluster-nagios-common-0.1.1-0.el6.noarch<br>
glusterfs-fuse-3.6.6-1.el6.x86_64<br>
glusterfs-debuginfo-3.6.6-1.el6.x86_64<br>
glusterfs-libs-3.6.6-1.el6.x86_64<br>
glusterfs-geo-replication-3.6.6-1.el6.x86_64<br>
glusterfs-api-3.6.6-1.el6.x86_64<br>
glusterfs-devel-3.6.6-1.el6.x86_64<br>
glusterfs-api-devel-3.6.6-1.el6.x86_64<br>
glusterfs-3.6.6-1.el6.x86_64<br>
glusterfs-cli-3.6.6-1.el6.x86_64<br>
glusterfs-rdma-3.6.6-1.el6.x86_64<br>
samba-vfs-glusterfs-4.1.11-2.el6.x86_64<br>
glusterfs-server-3.6.6-1.el6.x86_64<br>
glusterfs-extra-xlators-3.6.6-1.el6.x86_64<br>
</div>
<div> </div>
<div>
<div
style="FONT-SIZE:12pt;FONT-FAMILY:Times
New Roman"><span><span>
<div> </div>
</span></span></div>
</div>
</blockquote>
</div>
</blockquote>
</div>
<br>
<fieldset></fieldset>
<br>
</div>
</div>
<pre>_______________________________________________
Gluster-devel mailing list
<a moz-do-not-send="true" href="mailto:Gluster-devel@gluster.org" target="_blank">Gluster-devel@gluster.org</a>
<a moz-do-not-send="true" href="http://www.gluster.org/mailman/listinfo/gluster-devel" target="_blank">http://www.gluster.org/mailman/listinfo/gluster-devel</a></pre>
</blockquote>
<br>
</div>
<br>
_______________________________________________<br>
Gluster-users
mailing list<br>
<a
moz-do-not-send="true"
href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a><br>
<a
moz-do-not-send="true"
href="http://www.gluster.org/mailman/listinfo/gluster-users"
rel="noreferrer"
target="_blank">http://www.gluster.org/mailman/listinfo/gluster-users</a><br>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
</body>
</html>