[Gluster-users] One client can effectively hang entire gluster array

Glomski, Patrick patrick.glomski at corvidtec.com
Fri Jul 8 13:29:44 UTC 2016


Hello, users and devs.

TL;DR: One gluster client can essentially cause denial of service /
availability loss to entire gluster array. There's no way to stop it and
almost no way to find the bad client. Probably all (at least 3.6 and 3.7)
versions are affected.

We have two large replicate gluster arrays (3.6.6 and 3.7.11) that are used
in a high-performance computing environment. Two file access cases cause
severe issues with glusterfs: Some of our scientific codes write hundreds
of files (~400-500) simultaneously (one file or more per processor core, so
lots of small or large writes) and others read thousands of files
(2000-3000) simultaneously to grab metadata from each file (lots of small
reads).

In either of these situations, one glusterfsd process on whatever peer the
client is currently talking to will skyrocket to *nproc* cpu usage (800%,
1600%) and the storage cluster is essentially useless; all other clients
will eventually try to read or write data to the overloaded peer and, when
that happens, their connection will hang. Heals between peers hang because
the load on the peer is around 1.5x the number of cores or more. This
occurs in either gluster 3.6 or 3.7, is very repeatable, and happens much
too frequently.

Even worse, there seems to be no definitive way to diagnose which client is
causing the issues. Getting 'volume status <> clients' doesn't help because
it reports the total number of bytes read/written by each client. (a) The
metadata in question is tiny compared to the multi-gigabyte output files
being dealt with and (b) the byte-count is cumulative for the clients and
the compute nodes are always up with the filesystems mounted, so the byte
transfer counts are astronomical. The best solution I've come up with is to
blackhole-route traffic from clients one at a time (effectively push the
traffic over to the other peer), wait a few minutes for all of the
backlogged traffic to dissipate (if it's going to), see if the load on
glusterfsd drops, and repeat until I find the client causing the issue. I
would *love* any ideas on a better way to find rogue clients.

More importantly, though, there must be some feature envorced to stop one
user from having the capability to render the entire filesystem unavailable
for all other users. In the worst case, I would even prefer a gluster
volume option that simply disconnects clients making over some threshold of
file open requests. That's WAY more preferable than a complete availability
loss reminiscent of a DDoS attack...

Apologies for the essay and looking forward to any help you can provide.

Thanks,
Patrick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160708/4b38fbf1/attachment.html>


More information about the Gluster-users mailing list