[Gluster-users] Unexpected behaviour during replication heal

Darren Austin darren-lists at widgit.com
Tue Jun 28 22:31:26 UTC 2011


----- Original Message -----
> You can modify the parameter network.ping-timeout from 46sec to 5 or
> 10 second to reduce the "time stalls" of client.

Thanks for the heads up on that - i'll give it a try.

I was more concerned with the fact it stalled at all really.  If one server is still online and able to receive data, why is the client not sending it?

> If you don't kill the process and wait that all node are syncronized
> all the system should return ready.

Negative :)

I waited for quite some time 30+ minutes for things to come back to a usable state - it didn't happen.

There is *no* syncronisation going on while the mount point is inaccessible - the servers are NOT trying to sync the data in order to heal.

As I said, this is 100% replicable - the client will hard lock, and will not recover.  At least it didn't recover in the 30 minutes I left it alone to try to - and if it's not going to recover in 30 minutes, it's very little use as a redundant cluster storage system IMO.

> To force a syncronization of all volume you can type these command on
> the client:
> find <gluster-mount> -noleaf -print0 | xargs --null stat >/dev/null

I can't run that command on the client *at all*, since the mount point is completely hard locked.
I can't even get an ls in the directory.

When I say it's hard locked, I mean it's hard locked.  It's not that the client has disconnected from the server (which would give the transport disconnected error), but any process trying to access the directory is put into "un-interruptible IO wait" state.

> ------------------------------------------------------
> that happens because Gluster's self heal is a blocking operation. We
> are working on a non-blocking self heal, we are hoping to ship it in
> early September.
> ------------------------------------------------------

I can live with a blocking self-heal operation.

But seriously, is a redundant storage cluster supposed to hard lock the client when a server comes back online during a write operation?

> You can verify that directly from your client log... you can read
> that: [2011-06-28 13:28:17.484646] I
> [client-lk.c:617:decrement_reopen_fd_count] 0-data-volume-client-0:
> last fd open'd/lock-self-heal'd - notifying CHILD-UP

Trying to start the self-heal is not the same as actually doing it :(

There is no sync between the two servers in the situation I outlined, and the client cannot trigger a self-heal as you suggest because the client is effectively dead in the water until it's forcibly killed and re-mounted.

Cheers,
Darren.

-- 
Darren Austin - Systems Administrator, Widgit Software.
Tel: +44 (0)1926 333680.    Web: http://www.widgit.com/
26 Queen Street, Cubbington, Warwickshire, CV32 7NA.




More information about the Gluster-users mailing list