Re: Serialization of fops acting on same dentry on server

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 08/17/2015 01:19 AM, Raghavendra Gowdappa wrote:


----- Original Message -----
From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>
To: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>
Cc: "Sakshi Bansal" <sabansal@xxxxxxxxxx>
Sent: Monday, 17 August, 2015 10:39:38 AM
Subject:  Serialization of fops acting on same dentry on	server

All,

Pranith and me were discussing about implementation of compound operations
like "create + lock", "mkdir + lock", "open + lock" etc. These operations
are useful in situations like:

1. To prevent locking on all subvols during directory creation as part of
self heal in dht. Currently we are following approach of locking _all_
subvols by both rmdir and lookup-heal [1].

Correction. It should've been, "to prevent locking on all subvols during rmdir". The lookup self-heal should lock on all subvols (with compound "mkdir + lookup" if directory is not present on a subvol). With this rmdir/rename can lock on just any one subvol and this will prevent any parallel lookup-heal from preventing directory creation.

2. To lock a file in advance so that there is less performance hit during
transactions in afr.

I see multiple thoughts here and am splitting what I think into these parts,

- Compound FOPs:
The whole idea and need for compound FOPs I think is very useful. Initially compounding the FOP+Lock is a good idea as this is mostly internal to Gluster and does not change any interface to any of the consumers. Also, as Pranith is involved we can iron out AFR/EC related possibilities in such compounding as well.

In compounding I am only concerned about cases where part of the compound operation succeeds on one replica, but fails on the other, as an example if the mkdir succeeds on one and so locking subsequently succeeds, but mkdir fails on the other (because a competing clients compound FOP raced this one), how can we handle such situations? Do we need server side AFR/EC with leader election link in NSR to handle this? (maybe the example is not a good/firm one for this case, but nevertheless can compounding create such problems?)

Another question would be, we need to compound it as Lock+FOP rather than FOP+Lock in some cases, right?

- Advance locking to reduce serial RPC requests that degrade performance:
This is again a good thing to do, part of such a concept is in eager locking already (as I see it). What I would like to see in this regard would be eager leasing (piggyback leases) of a file (and loosely directory, as I need to think through that case more) so that we can optimize the common case when a file is being operated by a single client and degrade to fine grained locking when multiple clients compete.

Assuming eager leasing, AFR transactions need only client side in memory locking (to prevent 2 threads/consumers of the client racing on the same file/dir) and also, with leasing and lease breaking we can get better at cooperating with other clients than what eager locking does now.

In short, I would like to see the advance locking or leasing be, is part of the client side caching stack, so that multiple xlators on the client can leverage the same and I would like the leasing model over the locking model as it allows easier breaking than locks.


While thinking about implementing such compound operations, it occurred to me
that one of the problems would be how do we handle a racing mkdir/create and
a (named lookup - simply referred as lookup from now on - followed by lock).
This is because,
1. creation of directory/file on backend
2. linking of the inode with the gfid corresponding to that file/directory

are not atomic. It is not guaranteed that inode passed down during
mkdir/create call need not be the one that survives in inode table. Since
posix-locks xlator maintains all the lock-state in inode, it would be a
problem if a different inode is linked in inode table than the one passed
during mkdir/create. One way to solve this problem is to serialize fops
(like mkdir/create, lookup, rename, rmdir, unlink) that are happening on a
particular dentry. This serialization would also solve other bugs like:

1. issues solved by [2][3] and possibly many such issues.
2. Stale dentries left out in bricks' inode table because of a racing lookup
and dentry modification ops (like rmdir, unlink, rename etc).

Initial idea I've now is to maintain fops in-progress on a dentry in parent
inode (may be resolver code in protocol/server). Based on this we can
serialize the operations. Since we need to serialize _only_ operations on a
dentry (we don't serialize nameless lookups), it is guaranteed that we do
have a parent inode always. Any comments/discussion on this would be
appreciated.

My initial comments on this would be to refer to FS locking notes in Linux kernel, which has rules for locking during dentry operations and such.

The next part is as follows,
- Why create the name (dentry) before creating the inode (gfid instance) for a file or a directory? - A client cannot do a nameless lookup or will fail a named lookup if the named entry is not created yet (as nameless lookup assumes at some point in the past a named lookup returned the inode/gfid for the entry that is now being used to do a lookup) - So a mkdir/create can first create the gfid for the object that is being operated on and then the name (hard link), would this not resolve the problem of the race being discussed?

Also, on local FS (say XFS or other) doesn't a similar problem exist, i.e for a dentry entry to be created in the directory inode, it needs the inode number and name, so first the inode would need to be created and then linked into the directory with it's name and inode #. This problem is similar in this case for us as well, i.e create the gfid (which is the inode) and then link it into the directory i.e hardlink/softlink the name.

I think for directories we create the softlink the other way, i.e the gfid representation is a soft link to the real named directory, so may need some additional thought.


[1] http://review.gluster.org/11725
[2] http://review.gluster.org/9913
[3] http://review.gluster.org/5240

regards,
Raghavendra.
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel



[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux