<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Jul 8, 2016 at 8:02 PM, Jeff Darcy <span dir="ltr"><<a href="mailto:jdarcy@redhat.com" target="_blank">jdarcy@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">> In either of these situations, one glusterfsd process on whatever peer the<br>
> client is currently talking to will skyrocket to *nproc* cpu usage (800%,<br>
> 1600%) and the storage cluster is essentially useless; all other clients<br>
> will eventually try to read or write data to the overloaded peer and, when<br>
> that happens, their connection will hang. Heals between peers hang because<br>
> the load on the peer is around 1.5x the number of cores or more. This occurs<br>
> in either gluster 3.6 or 3.7, is very repeatable, and happens much too<br>
> frequently.<br>
<br>
</span>I have some good news and some bad news.<br>
<br>
The good news is that features to address this are already planned for the<br>
4.0 release. Primarily I'm referring to QoS enhancements, some parts of<br>
which were already implemented for the bitrot daemon. I'm still working<br>
out the exact requirements for this as a general facility, though. You<br>
can help! :) Also, some of the work on "brick multiplexing" (multiple<br>
bricks within one glusterfsd process) should help to prevent the thrashing<br>
that causes a complete freeze-up.<br>
<br>
Now for the bad news. Did I mention that these are 4.0 features? 4.0 is<br>
not near term, and not getting any nearer as other features and releases<br>
keep "jumping the queue" to absorb all of the resources we need for 4.0<br>
to happen. Not that I'm bitter or anything. ;) To address your more<br>
immediate concerns, I think we need to consider more modest changes that<br>
can be completed in more modest time. For example:<br>
<br>
* The load should *never* get to 1.5x the number of cores. Perhaps we<br>
could tweak the thread-scaling code in io-threads and epoll to check<br>
system load and not scale up (or even scale down) if system load is<br>
already high.<br>
<br>
* We might be able to tweak io-threads (which already runs on the<br>
bricks and already has a global queue) to schedule requests in a<br>
fairer way across clients. Right now it executes them in the<br>
same order that they were read from the network. </blockquote><div><br></div><div style="">This sounds to be an easier fix. We can make io-threads to factor in another input i.e., the client through which request came in (essentially frame->root->client) before scheduling. That should make the problem bearable at-least if not crippling. As to what algorithm to use, I think we can consider leaky bucket of bit-rot implementation or dmclock. I've not really thought deeper about the algorithm part. If the approach sounds ok, we can discuss more about algos.</div><div style=""><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> That tends to<br>
be a bit "unfair" and that should be fixed in the network code,<br>
but that's a much harder task.<br>
<br>
These are only weak approximations of what we really should be doing,<br>
and will be doing in the long term, but (without making any promises)<br>
they might be sufficient and achievable in the near term. Thoughts?<br>
_______________________________________________<br>
Gluster-devel mailing list<br>
<a href="mailto:Gluster-devel@gluster.org">Gluster-devel@gluster.org</a><br>
<a href="http://www.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">http://www.gluster.org/mailman/listinfo/gluster-devel</a><br>
</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature">Raghavendra G<br></div>
</div></div>