----- Original Message ----- > From: "Krutika Dhananjay" <kdhananj@xxxxxxxxxx> > To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx> > Cc: "Mohammed Rafi K C" <rkavunga@xxxxxxxxxx>, "Gluster Devel" <gluster-devel@xxxxxxxxxxx>, "Dan Lambright" > <dlambrig@xxxxxxxxxx>, "Nithya Balachandran" <nbalacha@xxxxxxxxxx>, "Ben Turner" <bturner@xxxxxxxxxx>, "Ben England" > <bengland@xxxxxxxxxx>, "Manoj Pillai" <mpillai@xxxxxxxxxx>, "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx>, > "Ravishankar Narayanankutty" <ranaraya@xxxxxxxxxx>, xhernandez@xxxxxxxxxx > Sent: Thursday, August 13, 2015 9:53:41 AM > Subject: Re: Inconsistent behavior due to lack of lookup on entry followed by readdirp > > ----- Original Message ----- > > > From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx> > > To: "Krutika Dhananjay" <kdhananj@xxxxxxxxxx> > > Cc: "Mohammed Rafi K C" <rkavunga@xxxxxxxxxx>, "Gluster Devel" > > <gluster-devel@xxxxxxxxxxx>, "Dan Lambright" <dlambrig@xxxxxxxxxx>, "Nithya > > Balachandran" <nbalacha@xxxxxxxxxx>, "Ben Turner" <bturner@xxxxxxxxxx>, > > "Ben > > England" <bengland@xxxxxxxxxx>, "Manoj Pillai" <mpillai@xxxxxxxxxx>, > > "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx>, "Ravishankar > > Narayanankutty" <ranaraya@xxxxxxxxxx>, xhernandez@xxxxxxxxxx > > Sent: Thursday, August 13, 2015 9:06:37 AM > > Subject: Re: Inconsistent behavior due to lack of lookup on entry followed > > by > > readdirp > > > ----- Original Message ----- > > > From: "Krutika Dhananjay" <kdhananj@xxxxxxxxxx> > > > To: "Mohammed Rafi K C" <rkavunga@xxxxxxxxxx> > > > Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>, "Dan Lambright" > > > <dlambrig@xxxxxxxxxx>, "Nithya Balachandran" > > > <nbalacha@xxxxxxxxxx>, "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>, "Ben > > > Turner" <bturner@xxxxxxxxxx>, "Ben > > > England" <bengland@xxxxxxxxxx>, "Manoj Pillai" <mpillai@xxxxxxxxxx>, > > > "Pranith Kumar Karampuri" > > > <pkarampu@xxxxxxxxxx>, "Ravishankar Narayanankutty" > > > <ranaraya@xxxxxxxxxx>, > > > xhernandez@xxxxxxxxxx > > > Sent: Wednesday, August 12, 2015 9:02:44 PM > > > Subject: Re: Inconsistent behavior due to lack of lookup on entry > > > followed > > > by readdirp > > > > > > I faced the same issue with the sharding translator. I fixed it by making > > > its > > > readdirp callback initialize individual entries' inode ctx, some of these > > > being xattr values, which are filled in entry->dict by the posix > > > translator. > > > Here is the patch that got merged recently: > > > http://review.gluster.org/11854 > > > Would that be as easy to do in DHT as well? > > > The problem is not just filling out state in the inode. The bigger problem > > is > > healing, which is supposed to maintain a directory/file to be in state > > consistent with our design before a successful reply to lookup. The > > operations can involve creating directories on missing subvols, setting > > appropriate layout, etc. Effectively for readdirp to replace lookup, it > > should be calling dht_lookup on each of the dentry it is passing back to > > application. > > OK. > > > > > > > As far as AFR is concerned, it indirectly forces LOOKUP on entries which > > > are > > > being retrieved for the first time through a READDIRP (and as a result do > > > not have their inode ctx etc initialised yet) by setting entry->inode to > > > NULL. See afr_readdir_transform_entries(). > > > Hmm. Then we already have "disabled" readdirp through code :). Without an > > inode corresponding to entry, readdirp will be effectively readdir > > stripping > > any performance benefits by having readdirp as a "batched" lookup (of all > > the dentries). > No. Not every single READDIRP will be transformed into a READDIR by AFR. AFR > resets the inode corresponding to an entry, before responding to its parent, > _only_ under the following two conditions: > 1) if this entry in question is being retrieved by this client for the first > time through a READDIRP. In other words, this client has not _yet_ performed > a LOOKUP on it. > 2) if that sub-volume of AFR on which the parent directory is being > READDIRP'd (remember AFR would only need to serve inode and directory reads > from one of the replicas) does _not_ contain a good copy of the entry. > In other words this entry needs to be healed on parent's read child. This is > because we do not want the caching translators or the application itself to > get incorrect entry attributes. Thanks Krutika. We'll be borrowing the idea of setting entry->inode to NULL, when dht determines that inode needs to be healed. Since afr is already doing that for all dentries during first READDIRP (barring any lookups on that inode before), I don't think doing this will have any further performance degradation (As most of the setups will be distributed-replicated). > > This means that more often than not, AFR _would_ be leaving the inode > corresponding to the entry as it is, and not setting it to NULL. > > > > This is the default behavior which is being made optional as part of > > > http://review.gluster.org/#/c/11846/ which is still under review (see BZ > > > 1250803, a performance bug :) ). > > > If it is made optional, when we enable setting entry->inode we still see > > consistency issues. Also, it seems to me that there is no point in having > > each individual xlator option controlling this behaviour. Instead we can > > make each xlator behave in compliance to global mount option > > "--use-readdirp=yes/no". Is there any specific reason to have an option to > > control this behaviour in afr? > Agreed that there will be consistency issues. > The reason to move away from letting 1) and 2) above be the default behavior > is performance. :) And I guess it is also partly because AFR-v1 does not > have these restrictions in READDIRP. But the patch is still under review. > > -Krutika > > > > > > > -Krutika > > > > > > ----- Original Message ----- > > > > > > > From: "Mohammed Rafi K C" <rkavunga@xxxxxxxxxx> > > > > To: "Gluster Devel" <gluster-devel@xxxxxxxxxxx> > > > > Cc: "Dan Lambright" <dlambrig@xxxxxxxxxx>, "Nithya Balachandran" > > > > <nbalacha@xxxxxxxxxx>, "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>, > > > > "Ben > > > > Turner" <bturner@xxxxxxxxxx>, "Ben England" <bengland@xxxxxxxxxx>, > > > > "Manoj > > > > Pillai" <mpillai@xxxxxxxxxx>, "Pranith Kumar Karampuri" > > > > <pkarampu@xxxxxxxxxx>, "Ravishankar Narayanankutty" > > > > <ranaraya@xxxxxxxxxx>, > > > > kdhananj@xxxxxxxxxx, xhernandez@xxxxxxxxxx > > > > Sent: Wednesday, August 12, 2015 7:29:48 PM > > > > Subject: Inconsistent behavior due to lack of lookup on entry followed > > > > by > > > > readdirp > > > > > > > Hi All, > > > > > > > We are facing some inconsistent behavior for fops like rename, unlink > > > > etc due to lack of lookup followed by a readdirp, more specifically if > > > > inodes/gfid are populated via readdirp call and this nodeid is shared > > > > with kernal, md-cache will cache this based on base-name. Then > > > > subsequent named lookup will be served from md-cache and it winds-back > > > > immediately. So there is a chance to have an FOP triggered with out > > > > having a lookup on an entry. DHT does lot of things like creating link > > > > files and populate inode_ctx etc, during lookup. In such scenario it is > > > > must to have at least one lookup to be happened on an entry. Since > > > > readdirp preventing the lookup, it has been very hard for fops to > > > > proceed without a first lookup on the entry. We are also suspecting > > > > some > > > > problems due to same with afr/ec self healing also. So If we remove > > > > readdirp from md-cache ([1], [2]) it causes, an additional hop for > > > > first > > > > lookup for every entry. I'm mostly concerned with this one extra > > > > network > > > > call, and the performance degradation caused by the same. > > > > > > > Now with this, the only advantage with readdirp is, it removes one > > > > context switch between kernal and userspace. Is it really worth to > > > > sacrifice this for consistency ? > > > > > > > What do you think about removing readdirp functionality? > > > > > > > Please provide your input/suggestion/ideas. > > > > > > > [1] : http://review.gluster.org/#/c/11892/ > > > > > > > [2] : http://review.gluster.org/#/c/11894/ > > > > > > > Thanks in Advance > > > > Rafi KC > > > > _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel