Re: Serialization of fops acting on same dentry on server

Shyam <srangana@xxxxxxxxxx> · Mon, 17 Aug 2015 10:07:15 -0400

On 08/17/2015 01:19 AM, Raghavendra Gowdappa wrote:

----- Original Message -----
From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>
To: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>
Cc: "Sakshi Bansal" <sabansal@xxxxxxxxxx>
Sent: Monday, 17 August, 2015 10:39:38 AM
Subject:  Serialization of fops acting on same dentry on	server

All,

Pranith and me were discussing about implementation of compound operations
like "create + lock", "mkdir + lock", "open + lock" etc. These operations
are useful in situations like:

1. To prevent locking on all subvols during directory creation as part of
self heal in dht. Currently we are following approach of locking _all_
subvols by both rmdir and lookup-heal [1].

Correction. It should've been, "to prevent locking on all subvols during rmdir". The lookup self-heal should lock on all subvols (with compound "mkdir + lookup" if directory is not present on a subvol). With this rmdir/rename can lock on just any one subvol and this will prevent any parallel lookup-heal from preventing directory creation.

2. To lock a file in advance so that there is less performance hit during
transactions in afr.

I see multiple thoughts here and am splitting what I think into these parts,

- Compound FOPs:
The whole idea and need for compound FOPs I think is very useful. 
Initially compounding the FOP+Lock is a good idea as this is mostly 
internal to Gluster and does not change any interface to any of the 
consumers. Also, as Pranith is involved we can iron out AFR/EC related 
possibilities in such compounding as well.

In compounding I am only concerned about cases where part of the 
compound operation succeeds on one replica, but fails on the other, as 
an example if the mkdir succeeds on one and so locking subsequently 
succeeds, but mkdir fails on the other (because a competing clients 
compound FOP raced this one), how can we handle such situations? Do we 
need server side AFR/EC with leader election link in NSR to handle this? 
(maybe the example is not a good/firm one for this case, but 
nevertheless can compounding create such problems?)

Another question would be, we need to compound it as Lock+FOP rather 
than FOP+Lock in some cases, right?

- Advance locking to reduce serial RPC requests that degrade performance:
This is again a good thing to do, part of such a concept is in eager 
locking already (as I see it). What I would like to see in this regard 
would be eager leasing (piggyback leases) of a file (and loosely 
directory, as I need to think through that case more) so that we can 
optimize the common case when a file is being operated by a single 
client and degrade to fine grained locking when multiple clients compete.

Assuming eager leasing, AFR transactions need only client side in memory 
locking (to prevent 2 threads/consumers of the client racing on the same 
file/dir) and also, with leasing and lease breaking we can get better at 
cooperating with other clients than what eager locking does now.

In short, I would like to see the advance locking or leasing be, is part 
of the client side caching stack, so that multiple xlators on the client 
can leverage the same and I would like the leasing model over the 
locking model as it allows easier breaking than locks.

While thinking about implementing such compound operations, it occurred to me
that one of the problems would be how do we handle a racing mkdir/create and
a (named lookup - simply referred as lookup from now on - followed by lock).
This is because,
1. creation of directory/file on backend
2. linking of the inode with the gfid corresponding to that file/directory

are not atomic. It is not guaranteed that inode passed down during
mkdir/create call need not be the one that survives in inode table. Since
posix-locks xlator maintains all the lock-state in inode, it would be a
problem if a different inode is linked in inode table than the one passed
during mkdir/create. One way to solve this problem is to serialize fops
(like mkdir/create, lookup, rename, rmdir, unlink) that are happening on a
particular dentry. This serialization would also solve other bugs like:

1. issues solved by [2][3] and possibly many such issues.
2. Stale dentries left out in bricks' inode table because of a racing lookup
and dentry modification ops (like rmdir, unlink, rename etc).

Initial idea I've now is to maintain fops in-progress on a dentry in parent
inode (may be resolver code in protocol/server). Based on this we can
serialize the operations. Since we need to serialize _only_ operations on a
dentry (we don't serialize nameless lookups), it is guaranteed that we do
have a parent inode always. Any comments/discussion on this would be
appreciated.

My initial comments on this would be to refer to FS locking notes in 
Linux kernel, which has rules for locking during dentry operations and such.

The next part is as follows,
- Why create the name (dentry) before creating the inode (gfid instance) 
for a file or a directory?
  - A client cannot do a nameless lookup or will fail a named lookup if 
the named entry is not created yet (as nameless lookup assumes at some 
point in the past a named lookup returned the inode/gfid for the entry 
that is now being used to do a lookup)
  - So a mkdir/create can first create the gfid for the object that is 
being operated on and then the name (hard link), would this not resolve 
the problem of the race being discussed?

Also, on local FS (say XFS or other) doesn't a similar problem exist, 
i.e for a dentry entry to be created in the directory inode, it needs 
the inode number and name, so first the inode would need to be created 
and then linked into the directory with it's name and inode #. This 
problem is similar in this case for us as well, i.e create the gfid 
(which is the inode) and then link it into the directory i.e 
hardlink/softlink the name.

I think for directories we create the softlink the other way, i.e the 
gfid representation is a soft link to the real named directory, so may 
need some additional thought.

[1] http://review.gluster.org/11725
[2] http://review.gluster.org/9913
[3] http://review.gluster.org/5240

regards,
Raghavendra.
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel