<div dir="ltr">Allright, it seems we're fine now!<br><br><br>We basically took two actions and the network issue seems gone.<div><br></div><div>1. These servers are VM on a cloud provider, so I don't really have access to details here. The assigned sysadmin reported that one of my Gluster VMs were on a crowded host, and that could be potentially been affecting on both load (CPU/memory) and network performance. He moved this one VM to a new (and more free) host. The other VM that is part of this gluster setup was kept as before.</div><div><br></div><div>2. I set up a new internet-isolated sub-net between these VMs, allowing me to get firewall out of the way.</div><div><br></div><div>It seems #1 was the responsible, and #2 was an achieved nice-to-have.</div><div><br></div><div><br></div><div>Before:</div><div><br></div><div><div style="font-size:12.8000001907349px">root@web3:~# date; time ls -ltrh /var/www/site-images/templates/assets/prod/temporary/13/user_1339200.png</div><div style="font-size:12.8000001907349px">Mon Jan 26 07:00:27 PST 2015</div><div style="font-size:12.8000001907349px">-rwx---r-- 1 mhmadmin mhmadmin 61K Jan 22 14:37 /var/www/site-images/templates/assets/prod/temporary/13/user_1339200.png</div><div style="font-size:12.8000001907349px"><br></div><div style="font-size:12.8000001907349px">real<span style="white-space:pre-wrap">        </span>0m<b>33.651s</b></div><div style="font-size:12.8000001907349px">user<span style="white-space:pre-wrap">        </span>0m0.001s</div><div style="font-size:12.8000001907349px">sys<span style="white-space:pre-wrap">        </span>0m0.004s</div></div><div><br></div><div><br></div><div>After:</div><div><br></div><div><div>root@web3:~# date; time ls -ltrh /var/www/site-images/templates/assets/prod/temporary/13/user_1410560.png</div><div>Tue Feb 10 15:28:18 PST 2015</div><div>-rwx---r-- 1 mhmadmin mhmadmin 17K Feb 10 12:41 /var/www/site-images/templates/assets/prod/temporary/13/user_1410560.png</div><div><br></div><div>real<span class="" style="white-space:pre">        </span>0m<b>0.031s</b></div><div>user<span class="" style="white-space:pre">        </span>0m0.001s</div><div>sys<span class="" style="white-space:pre">        </span>0m0.006s</div></div><div><br></div><div><br></div><div>The case seems closed. If you guys have any questions that I know the answer or can reply, please let me know.</div><div><br></div><div><br></div><div>Thanks Anirban, Joe and selected audience :)</div><div><br></div><div><br></div><div>-- <br><div class="gmail_signature"><div dir="ltr"><div dir="ltr"><font color="#444444"><b>Tiago Santos</b></font></div></div></div></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jan 28, 2015 at 2:45 PM, Tiago Santos <span dir="ltr"><<a href="mailto:tiago@musthavemenus.com" target="_blank">tiago@musthavemenus.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr">Since I stopped writing to the clients (so I could cleanly work on the split brain) I got no more entries on /var/log/gluster.log (this is the client log, right?)<div><br></div><div><br></div><div>While working with diff command in order to fix the split brain, I saw several entries like these:<div><br></div><div><div>diff: r2/webhost/sites/clipart/assets/apache/images/13/templates/558482: Transport endpoint is not connected</div><div>diff: r2/webhost/sites/clipart/assets/apache/images/13/templates/558483: Transport endpoint is not connected</div><div>diff: r2/webhost/sites/clipart/assets/apache/images/13/templates/558484: Transport endpoint is not connected</div></div><div><br></div><div>They happen a lot, then stops. Then happen again and so on.</div><div><br></div><div>At the same time the errors are showing, ping from the system I'm working on split-brain to the system that is failing to connect (r2) shows this:</div><div><br></div><div><div>64 bytes from r2-server (r2-ip): icmp_seq=662 ttl=64 time=1.21 ms</div><div>64 bytes from r2-server (r2-ip): icmp_seq=663 ttl=64 time=0.990 ms</div><div>64 bytes from r2-server (r2-ip): icmp_seq=664 ttl=64 time=1.01 ms</div></div><div><br></div><div>I know this is a very trivial network checking that may not be showing me what I want to see, and I'm working on more elaborated one. But I'm completely open for suggestions on how to properly do that in order to verify if this is issue when talking about gluster.</div><div><br></div><div><br></div><div>So far, thank you so much, guys!</div><div> <br></div></div></div></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div><div></div><div><br></div></div></div><div class="gmail_extra"><div class="h5"><br><div class="gmail_quote">On Mon, Jan 26, 2015 at 8:36 PM, Joe Julian <span dir="ltr"><<a href="mailto:joe@julianfamily.org" target="_blank">joe@julianfamily.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
Check your client logs. Perhaps the client isn't actually connecting
to both servers. <br><div>
<br>
<div>On 01/26/2015 02:12 PM, Tiago Santos
wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">That's what I meant. Sorry for the confusion.<br>
<br>
I'm writing on Client1 (same server as Brick1). Client2 (mounted
Brick2, on server2) has nothing writing to it (so far).
<div><br>
</div>
<div>My wondering is how I went up on having a split-brain if
I'm only writing on one client.<br>
<div><br>
</div>
<div><br>
</div>
<div><br>
<br>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Mon, Jan 26, 2015 at 8:04 PM,
Joe Julian <span dir="ltr"><<a href="mailto:joe@julianfamily.org" target="_blank">joe@julianfamily.org</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> Nothing but
GlusterFS should be writing to bricks. Mount a
client and write there.
<div>
<div><br>
<br>
<div>On 01/26/2015 01:38 PM, Tiago Santos wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">Right.
<div><br>
</div>
<div>I have Brick1 being constantly written.
But I have nothing writing on Brick2. It
just get "healed" data from Brick1.</div>
<div><br>
</div>
<div>This setup is still not in production,
and there's no applications using that
data. I have rsyncs constantly updating
Brick1 (bring data from production
servers), and then Gluster updates Brick2.</div>
<div><br>
</div>
<div>Which makes me wonder how may I be
creating multiple replicas during a
split-brain.</div>
<div><br>
</div>
<div><br>
</div>
<div>It may be the case that, having a
split-brain event, I may be updating
versions of the same file on Brick1
(only), and Gluster understands it as
different versions and things get confuse?</div>
<div><br>
</div>
<div><br>
</div>
<div>Anyways, while we talk I'm gonna run
Joe's precious procedure on split-brain
recovery.</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Mon, Jan 26,
2015 at 7:23 PM, Joe Julian <span dir="ltr"><<a href="mailto:joe@julianfamily.org" target="_blank">joe@julianfamily.org</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Mismatched
GFIDs would happen if a file is created
on multiple replicas during a
split-brain event. The GFID is assigned
at file creation.
<div>
<div><br>
<br>
On 01/26/2015 01:04 PM, A Ghoshal
wrote:<br>
</div>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div>
<div> Yep, so it is indeed a
split-brain caused by a mismatch
of the trusted.gfid attribute.<br>
<br>
Sadly, I don't know precisely what
causes it. -Communication loss
might be one of the triggers. I am
guessing the files with the
problem are dynamic, correct? In
our setup (also replica 2),
communication is never a problem
but we do see this when one of the
server takes a reboot. Maybe some
obscure and difficult to
understand race between background
self-heal and the self heal
daemon...<br>
<br>
In any case, a normal procedure
for split brain recovery would
work for you if you wish to get
you files back in function. It's
easy to find on google. I use the
instructions on Joe Julian's blog
page myself.<br>
<br>
<br>
-----Tiago Santos <<a href="mailto:tiago@musthavemenus.com" target="_blank">tiago@musthavemenus.com</a>>
wrote: -----<br>
<br>
=======================<br>
To: A Ghoshal <<a href="mailto:a.ghoshal@tcs.com" target="_blank">a.ghoshal@tcs.com</a>><br>
From: Tiago Santos <<a href="mailto:tiago@musthavemenus.com" target="_blank">tiago@musthavemenus.com</a>><br>
Date: 01/27/2015 02:11AM<br>
Cc: gluster-users <<a href="mailto:gluster-users@gluster.org" target="_blank">gluster-users@gluster.org</a>><br>
Subject: Re: [Gluster-users]
Pretty much any operation related
to Gluster mounted fs hangs for a
while<br>
=======================<br>
Oh, right!<br>
<br>
Follow the outputs:<br>
<br>
<br>
root@web3:/export/images1-1/brick#
time getfattr -m . -d -e hex<br>
templates/assets/prod/temporary/13/user_1339200.png<br>
# file:
templates/assets/prod/temporary/13/user_1339200.png<br>
trusted.afr.site-images-client-0=0x000000000000000400000000<br>
trusted.afr.site-images-client-1=0x000000020000000900000000<br>
trusted.gfid=0x10e5894c474a4cb1898b71e872cdf527<br>
<br>
real 0m0.024s<br>
user 0m0.001s<br>
sys 0m0.001s<br>
<br>
<br>
<br>
root@web4:/export/images2-1/brick#
time getfattr -m . -d -e hex<br>
templates/assets/prod/temporary/13/user_1339200.png<br>
# file:
templates/assets/prod/temporary/13/user_1339200.png<br>
trusted.afr.site-images-client-0=0x000000000000000000000000<br>
trusted.afr.site-images-client-1=0x000000000000000000000000<br>
trusted.gfid=0xd02f14fcb6724ceba4a330eb606910f3<br>
<br>
real 0m0.003s<br>
user 0m0.000s<br>
sys 0m0.006s<br>
<br>
<br>
Not sure exactly what that means.
I'm googling, and would appreciate
if you<br>
guys can bring some light.<br>
<br>
Thanks!<br>
--<br>
Tiago<br>
<br>
<br>
<br>
<br>
On Mon, Jan 26, 2015 at 6:16 PM, A
Ghoshal <<a href="mailto:a.ghoshal@tcs.com" target="_blank">a.ghoshal@tcs.com</a>>
wrote:<br>
<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
Actually you ran getfattr on the
volume - which is why the
requisite<br>
extended attributes never showed
up...<br>
<br>
Your bricks are mounted
elsewhere.<br>
/exports/images1-1/brick, and
exports/images2-1/brick<br>
<br>
Btw, what version of Linux do
you use? And, are the files you
observe the<br>
input/output errors on
soft-links?<br>
<br>
-----Tiago Santos <<a href="mailto:tiago@musthavemenus.com" target="_blank">tiago@musthavemenus.com</a>>
wrote: -----<br>
<br>
=======================<br>
To: A Ghoshal <<a href="mailto:a.ghoshal@tcs.com" target="_blank">a.ghoshal@tcs.com</a>><br>
From: Tiago Santos <<a href="mailto:tiago@musthavemenus.com" target="_blank">tiago@musthavemenus.com</a>><br>
Date: 01/27/2015 12:20AM<br>
Cc: gluster-users <<a href="mailto:gluster-users@gluster.org" target="_blank">gluster-users@gluster.org</a>><br>
Subject: Re: [Gluster-users]
Pretty much any operation
related to Gluster<br>
mounted fs hangs for a while<br>
=======================<br>
Thanks for you input,
Anirban.<br>
<br>
I ran the commands on both
servers, with the following
results:<br>
<br>
<br>
root@web3:/var/www/site-images#
time getfattr -m . -d -e hex<br>
templates/assets/prod/temporary/13/user_1339200.png<br>
<br>
real 0m34.524s<br>
user 0m0.004s<br>
sys 0m0.000s<br>
<br>
<br>
root@web4:/var/www/site-images#
time getfattr -m . -d -e hex<br>
templates/assets/prod/temporary/13/user_1339200.png<br>
getfattr:
templates/assets/prod/temporary/13/user_1339200.png:
Input/output<br>
error<br>
<br>
real 0m11.315s<br>
user 0m0.001s<br>
sys 0m0.003s<br>
root@web4:/var/www/site-images#
ls<br>
templates/assets/prod/temporary/13/user_1339200.png<br>
ls: cannot access
templates/assets/prod/temporary/13/user_1339200.png:<br>
Input/output error<br></blockquote></div></div></blockquote><div></div></blockquote></div></div></blockquote></div></div></div></blockquote></div></div></div></div></div></blockquote></div></div></blockquote></div></div></div></blockquote></div>
</div></div>