On Tue, Dec 16, 2014 at 04:22:20PM -0500, Shyam wrote: > On 12/16/2014 03:21 PM, J. Bruce Fields wrote: > >On Tue, Dec 16, 2014 at 11:46:46AM -0500, Shyam wrote: > >>On 12/15/2014 09:06 PM, Anand Avati wrote: > >>>Replies inline > >>> > >>>On Mon Dec 15 2014 at 12:46:41 PM Shyam <srangana@xxxxxxxxxx > >>><mailto:srangana@xxxxxxxxxx>> wrote: > >>> > >>> With the changes present in [1] and [2], > >>> > >>> A short explanation of the change would be, we encode the subvol ID in > >>> the d_off, losing 'n + 1' bits in case the high order n+1 bits of the > >>> underlying xlator returned d_off is not free. (Best to read the commit > >>> message for [1] :) ) > >>> > >>> Although not related to the latest patch, here is something to consider > >>> for the future: > >>> > >>> We now have DHT, AFR, EC(?), DHT over DHT (Tier) which need subvol > >>> encoding in the returned readdir offset. Due to this, the loss in bits > >>> _may_ cause unwanted offset behavior, when used in the current scheme. > >>> As we would end up eating more bits than what we do at present. > >>> > >>> Or IOW, we could be invalidating the assumption "both EXT4/XFS are > >>> tolerant in terms of the accuracy of the value presented > >>> back in seekdir(). > >>> > >>> > >>>XFS has not been a problem, since it always returns 32bit d_off. With > >>>Ext4, it has been noted that it is tolerant to sacrificing the lower > >>>bits in accuracy. > >>> > >>> i.e, a seekdir(val) actually seeks to the entry which > >>> has the "closest" true offset." > >>> > >>> Should we reconsider an in memory _cookie_ like approach that can help > >>> in this case? > >>> > >>> It would invalidate (some or all based on the implementation) the > >>> following constraints that the current design resolves, (from, [1]) > >>> - Nothing to "remember in memory" or evict "old entries". > >>> - Works fine across NFS server reboots and also NFS head failover. > >>> - Tolerant to seekdir() to arbitrary locations. > >>> > >>> But, would provide a more reliable readdir offset for use (when valid > >>> and not evicted, say). > >>> > >>> How would NFS adapt to this? Does Ganesha need a better scheme when > >>> doing multi-head NFS fail over? > >>> > >>> > >>>Ganesha just offloads the responsibility to the FSAL layer to give > >>>stable dir cookies (as it rightly should) > >>> > >>> > >>> Thoughts? > >>> > >>> > >>>I think we need to analyze the actual assumption/problem here. > >>>Remembering things in memory comes with the limitations you note above, > >>>and may after all, still not be necessary. Let's look at the two > >>>approaches taken: > >>> > >>>- Small backend offsets: like XFS, the offsets fit in 32bits, and we are > >>>left with another 32bits of freedom to encode what we want. There is no > >>>problem here until our nested encoding requirements cross 32bits of > >>>space. So let's ignore this for now. > >>> > >>>- Large backend offsets: Ext4 being the primary target. Here we observe > >>>that the backend filesystem is tolerant to sacrificing the accuracy of > >>>lower bits. So we overwrite the lower bits with our subvolume encoding > >>>information, and the number of bits used to encode is implicit in the > >>>subvolume cardinality of that translator. While this works fine with a > >>>single transformation, it is clearly a problem when the transformation > >>>is nested with the same algorithm. The reason is quite simple: while the > >>>lower bits were disposable when the cookie was taken fresh from Ext4, > >>>once transformed the same lower bits are now "holy" and cannot be > >>>overwritten carelessly, at least without dire consequences. The higher > >>>level xlators need to take up the "next higher bits", past the previous > >>>transformation boundary, to encode the next subvolume information. Once > >>>the d_off transformation algorithms are fixed to give such due "respect" > >>>to the lower layer's transformation and use a different real estate, we > >>>might actually notice that the problem may not need such a deep redesign > >>>after all. > >> > >>Agreed, my lack of understanding though is how may bits can be > >>sacrificed for ext4? I do not have that data, any pointers there > >>would help. (did go through https://lwn.net/Articles/544520/ but > >>that does not have the tolerance information in it) > >> > >>Here is what I have as the current bits lost based on the following > >>volume configuration, > >>- 2 Tiers (DHT over DHT) > >>- 128 subvols per DHT > >>- Each DHT instance is either AFR or EC subvolumes, with 2 replicas > >>and say 6 bricks per EC instance > >> > >>So EC side of the subvol needs log(2)6 (EC) + log(2)128 (DHT) + > >>log(2)2 (Tier) = 3 + 7 + 1, or 11 bits of the actual d_off used to > >>encode the volume, +1 for the high order bit to denote the encoding. > >>(AFR would have 1 bit less, so we can consider just the EC side of > >>things for the maximum loss computation at present) > >> > >>Is 12 bits still a tolerable loss for ext4? Or, till how many bits > >>can we still use the current scheme? > > > >To a first approximation, assume ext4's offsets are random numbers (it's > >actually some kind of weak cryptographic hash). So in a directory with > >n entries, you're basically picking n random numbers out of a hat and > >hoping you never pick the same number twice. > > > >So following preshing.com/20110504/hash-collision-probabilities, the > >chance that two entries in an n-entry directory will have the same b-bit > >cookie are (for small n/2^b), roughly > > > > n^2 / 2^(b+1) > > > >I think 52 bits is still a lot. Even on million-entry directories I > >only get a one-in-ten-thousand chance. But play with the numbers and > >make sure I haven't messed up somewhere. > > Well I did the (different) math on the following premise, and here > is what I have, > > For even distribution of hashes in a given range > n hash values generated (directory entries) > N maximum value in the hash range > s Separation distance of 2 values, (N/n) > b potential bits lost, but separation still preserved, floor(log(s)(base 2)) > > Basically as we are losing some bits to our encoding (lower bits of > the d_off), given a 'k' (as the d_off) finding the nearest hash that > it could represent, would mean losing less than 1/2 the bits of the > distance separating 2 hash values (assuming even distribution of > hash values). Which is what 'b' is above. > > With this and N set to represent a 64 bit range, we would need > ~16 > million entries, before losing 20 bits of information starts > mattering. I don't follow the argument, but in any case: > Considering this, it seems safe enough against ext4 (assuming the > hash function that is used has the property of value distribution > evenly across the range (at least to a point)) These are cryptographic hashes so they behave a lot like random numbers, and definitely won't be spaced in any convenient way. The birthday paradox says that with a 44-bit hash we're more likely than not to start seeing collisions somewhere around 2^22 directory entries. That 16-million-entry-directory would have a lot of collisions. Once a directory has a collision, though, we'll only notice if a client happens to attempt a readdir starting at the bad offset. If they're reading through the whole directory they're probably doing it in chunks of some thousand entries at a time so the probably of actually seeing a bug is lower than my estimate. --b. > and assuming we even > grow to 1/2 a million bricks in a gluster volume (~2^19). > > Alert: Above math needs checking and verification > > > > >Anyway, a longer-term approach might be to fix the NFS protocol's > >reliance on stable directory cookies, but of course then you have to > >wait till the NFS clients get upgraded. > > Also, other file systems may choose to use the d_off cookie > differently, but I assume that is handled when it appears on the > horizon. > > > > >--b. > > Thanks, > Shyam > > > > >> > >>If we move to 1000/10000 node gluster in 4.0, assuming everything > >>remains the same except DHT, we need an additional 3-5 bits for the > >>DHT subvol encoding. Would this still survive the ext4 encoding > >>scheme for d_off? > >> > >>> > >>>Hope that helps > >>>Thanks > >>> > >>> Shyam > >>> [1] http://review.gluster.org/#/c/__4711/ > >>> <http://review.gluster.org/#/c/4711/> > >>> [2] http://review.gluster.org/#/c/__8201/ > >>> <http://review.gluster.org/#/c/8201/> > >>> _________________________________________________ > >>> Gluster-devel mailing list > >>> Gluster-devel@xxxxxxxxxxx <mailto:Gluster-devel@xxxxxxxxxxx> > >>> http://supercolony.gluster.__org/mailman/listinfo/gluster-__devel > >>> <http://supercolony.gluster.org/mailman/listinfo/gluster-devel> > >>> > >>_______________________________________________ > >>Gluster-devel mailing list > >>Gluster-devel@xxxxxxxxxxx > >>http://supercolony.gluster.org/mailman/listinfo/gluster-devel _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-devel