[Gluster-devel] Serialization of fops acting on same dentry on server

Shyam srangana at redhat.com
Mon Aug 17 14:07:15 UTC 2015


On 08/17/2015 01:19 AM, Raghavendra Gowdappa wrote:
>
>
> ----- Original Message -----
>> From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
>> To: "Gluster Devel" <gluster-devel at gluster.org>
>> Cc: "Sakshi Bansal" <sabansal at redhat.com>
>> Sent: Monday, 17 August, 2015 10:39:38 AM
>> Subject: [Gluster-devel] Serialization of fops acting on same dentry on	server
>>
>> All,
>>
>> Pranith and me were discussing about implementation of compound operations
>> like "create + lock", "mkdir + lock", "open + lock" etc. These operations
>> are useful in situations like:
>>
>> 1. To prevent locking on all subvols during directory creation as part of
>> self heal in dht. Currently we are following approach of locking _all_
>> subvols by both rmdir and lookup-heal [1].
>
> Correction. It should've been, "to prevent locking on all subvols during rmdir". The lookup self-heal should lock on all subvols (with compound "mkdir + lookup" if directory is not present on a subvol). With this rmdir/rename can lock on just any one subvol and this will prevent any parallel lookup-heal from preventing directory creation.
>
>> 2. To lock a file in advance so that there is less performance hit during
>> transactions in afr.

I see multiple thoughts here and am splitting what I think into these parts,

- Compound FOPs:
The whole idea and need for compound FOPs I think is very useful. 
Initially compounding the FOP+Lock is a good idea as this is mostly 
internal to Gluster and does not change any interface to any of the 
consumers. Also, as Pranith is involved we can iron out AFR/EC related 
possibilities in such compounding as well.

In compounding I am only concerned about cases where part of the 
compound operation succeeds on one replica, but fails on the other, as 
an example if the mkdir succeeds on one and so locking subsequently 
succeeds, but mkdir fails on the other (because a competing clients 
compound FOP raced this one), how can we handle such situations? Do we 
need server side AFR/EC with leader election link in NSR to handle this? 
(maybe the example is not a good/firm one for this case, but 
nevertheless can compounding create such problems?)

Another question would be, we need to compound it as Lock+FOP rather 
than FOP+Lock in some cases, right?

- Advance locking to reduce serial RPC requests that degrade performance:
This is again a good thing to do, part of such a concept is in eager 
locking already (as I see it). What I would like to see in this regard 
would be eager leasing (piggyback leases) of a file (and loosely 
directory, as I need to think through that case more) so that we can 
optimize the common case when a file is being operated by a single 
client and degrade to fine grained locking when multiple clients compete.

Assuming eager leasing, AFR transactions need only client side in memory 
locking (to prevent 2 threads/consumers of the client racing on the same 
file/dir) and also, with leasing and lease breaking we can get better at 
cooperating with other clients than what eager locking does now.

In short, I would like to see the advance locking or leasing be, is part 
of the client side caching stack, so that multiple xlators on the client 
can leverage the same and I would like the leasing model over the 
locking model as it allows easier breaking than locks.

>>
>> While thinking about implementing such compound operations, it occurred to me
>> that one of the problems would be how do we handle a racing mkdir/create and
>> a (named lookup - simply referred as lookup from now on - followed by lock).
>> This is because,
>> 1. creation of directory/file on backend
>> 2. linking of the inode with the gfid corresponding to that file/directory
>>
>> are not atomic. It is not guaranteed that inode passed down during
>> mkdir/create call need not be the one that survives in inode table. Since
>> posix-locks xlator maintains all the lock-state in inode, it would be a
>> problem if a different inode is linked in inode table than the one passed
>> during mkdir/create. One way to solve this problem is to serialize fops
>> (like mkdir/create, lookup, rename, rmdir, unlink) that are happening on a
>> particular dentry. This serialization would also solve other bugs like:
>>
>> 1. issues solved by [2][3] and possibly many such issues.
>> 2. Stale dentries left out in bricks' inode table because of a racing lookup
>> and dentry modification ops (like rmdir, unlink, rename etc).
>>
>> Initial idea I've now is to maintain fops in-progress on a dentry in parent
>> inode (may be resolver code in protocol/server). Based on this we can
>> serialize the operations. Since we need to serialize _only_ operations on a
>> dentry (we don't serialize nameless lookups), it is guaranteed that we do
>> have a parent inode always. Any comments/discussion on this would be
>> appreciated.

My initial comments on this would be to refer to FS locking notes in 
Linux kernel, which has rules for locking during dentry operations and such.

The next part is as follows,
- Why create the name (dentry) before creating the inode (gfid instance) 
for a file or a directory?
   - A client cannot do a nameless lookup or will fail a named lookup if 
the named entry is not created yet (as nameless lookup assumes at some 
point in the past a named lookup returned the inode/gfid for the entry 
that is now being used to do a lookup)
   - So a mkdir/create can first create the gfid for the object that is 
being operated on and then the name (hard link), would this not resolve 
the problem of the race being discussed?

Also, on local FS (say XFS or other) doesn't a similar problem exist, 
i.e for a dentry entry to be created in the directory inode, it needs 
the inode number and name, so first the inode would need to be created 
and then linked into the directory with it's name and inode #. This 
problem is similar in this case for us as well, i.e create the gfid 
(which is the inode) and then link it into the directory i.e 
hardlink/softlink the name.

I think for directories we create the softlink the other way, i.e the 
gfid representation is a soft link to the real named directory, so may 
need some additional thought.

>>
>> [1] http://review.gluster.org/11725
>> [2] http://review.gluster.org/9913
>> [3] http://review.gluster.org/5240
>>
>> regards,
>> Raghavendra.
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>


More information about the Gluster-devel mailing list