.. also keep in mind we may want more than two DHT layers if we spin up the data classification project in the future. ----- Original Message ----- > From: "Shyam" <srangana@xxxxxxxxxx> > To: "Anand Avati" <avati@xxxxxxxxxxx>, "Gluster Devel" <gluster-devel@xxxxxxxxxxx>, "Soumya Koduri" > <skoduri@xxxxxxxxxx> > Sent: Tuesday, December 16, 2014 11:46:46 AM > Subject: Re: Readdir d_off encoding > > On 12/15/2014 09:06 PM, Anand Avati wrote: > > Replies inline > > > > On Mon Dec 15 2014 at 12:46:41 PM Shyam <srangana@xxxxxxxxxx > > <mailto:srangana@xxxxxxxxxx>> wrote: > > > > With the changes present in [1] and [2], > > > > A short explanation of the change would be, we encode the subvol ID in > > the d_off, losing 'n + 1' bits in case the high order n+1 bits of the > > underlying xlator returned d_off is not free. (Best to read the commit > > message for [1] :) ) > > > > Although not related to the latest patch, here is something to consider > > for the future: > > > > We now have DHT, AFR, EC(?), DHT over DHT (Tier) which need subvol > > encoding in the returned readdir offset. Due to this, the loss in bits > > _may_ cause unwanted offset behavior, when used in the current scheme. > > As we would end up eating more bits than what we do at present. > > > > Or IOW, we could be invalidating the assumption "both EXT4/XFS are > > tolerant in terms of the accuracy of the value presented > > back in seekdir(). > > > > > > XFS has not been a problem, since it always returns 32bit d_off. With > > Ext4, it has been noted that it is tolerant to sacrificing the lower > > bits in accuracy. > > > > i.e, a seekdir(val) actually seeks to the entry which > > has the "closest" true offset." > > > > Should we reconsider an in memory _cookie_ like approach that can help > > in this case? > > > > It would invalidate (some or all based on the implementation) the > > following constraints that the current design resolves, (from, [1]) > > - Nothing to "remember in memory" or evict "old entries". > > - Works fine across NFS server reboots and also NFS head failover. > > - Tolerant to seekdir() to arbitrary locations. > > > > But, would provide a more reliable readdir offset for use (when valid > > and not evicted, say). > > > > How would NFS adapt to this? Does Ganesha need a better scheme when > > doing multi-head NFS fail over? > > > > > > Ganesha just offloads the responsibility to the FSAL layer to give > > stable dir cookies (as it rightly should) > > > > > > Thoughts? > > > > > > I think we need to analyze the actual assumption/problem here. > > Remembering things in memory comes with the limitations you note above, > > and may after all, still not be necessary. Let's look at the two > > approaches taken: > > > > - Small backend offsets: like XFS, the offsets fit in 32bits, and we are > > left with another 32bits of freedom to encode what we want. There is no > > problem here until our nested encoding requirements cross 32bits of > > space. So let's ignore this for now. > > > > - Large backend offsets: Ext4 being the primary target. Here we observe > > that the backend filesystem is tolerant to sacrificing the accuracy of > > lower bits. So we overwrite the lower bits with our subvolume encoding > > information, and the number of bits used to encode is implicit in the > > subvolume cardinality of that translator. While this works fine with a > > single transformation, it is clearly a problem when the transformation > > is nested with the same algorithm. The reason is quite simple: while the > > lower bits were disposable when the cookie was taken fresh from Ext4, > > once transformed the same lower bits are now "holy" and cannot be > > overwritten carelessly, at least without dire consequences. The higher > > level xlators need to take up the "next higher bits", past the previous > > transformation boundary, to encode the next subvolume information. Once > > the d_off transformation algorithms are fixed to give such due "respect" > > to the lower layer's transformation and use a different real estate, we > > might actually notice that the problem may not need such a deep redesign > > after all. > > Agreed, my lack of understanding though is how may bits can be > sacrificed for ext4? I do not have that data, any pointers there would > help. (did go through https://lwn.net/Articles/544520/ but that does not > have the tolerance information in it) > > Here is what I have as the current bits lost based on the following > volume configuration, > - 2 Tiers (DHT over DHT) > - 128 subvols per DHT > - Each DHT instance is either AFR or EC subvolumes, with 2 replicas and > say 6 bricks per EC instance > > So EC side of the subvol needs log(2)6 (EC) + log(2)128 (DHT) + log(2)2 > (Tier) = 3 + 7 + 1, or 11 bits of the actual d_off used to encode the > volume, +1 for the high order bit to denote the encoding. (AFR would > have 1 bit less, so we can consider just the EC side of things for the > maximum loss computation at present) > > Is 12 bits still a tolerable loss for ext4? Or, till how many bits can > we still use the current scheme? > > If we move to 1000/10000 node gluster in 4.0, assuming everything > remains the same except DHT, we need an additional 3-5 bits for the DHT > subvol encoding. Would this still survive the ext4 encoding scheme for > d_off? > > > > > Hope that helps > > Thanks > > > > Shyam > > [1] http://review.gluster.org/#/c/__4711/ > > <http://review.gluster.org/#/c/4711/> > > [2] http://review.gluster.org/#/c/__8201/ > > <http://review.gluster.org/#/c/8201/> > > _________________________________________________ > > Gluster-devel mailing list > > Gluster-devel@xxxxxxxxxxx <mailto:Gluster-devel@xxxxxxxxxxx> > > http://supercolony.gluster.__org/mailman/listinfo/gluster-__devel > > <http://supercolony.gluster.org/mailman/listinfo/gluster-devel> > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxxx > http://supercolony.gluster.org/mailman/listinfo/gluster-devel > _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-devel