Re: Readdir d_off encoding

"J. Bruce Fields" <bfields@xxxxxxxxxxxx> · Tue, 16 Dec 2014 16:59:51 -0500

On Tue, Dec 16, 2014 at 04:22:20PM -0500, Shyam wrote:
> On 12/16/2014 03:21 PM, J. Bruce Fields wrote:
> >On Tue, Dec 16, 2014 at 11:46:46AM -0500, Shyam wrote:
> >>On 12/15/2014 09:06 PM, Anand Avati wrote:
> >>>Replies inline
> >>>
> >>>On Mon Dec 15 2014 at 12:46:41 PM Shyam <srangana@xxxxxxxxxx
> >>><mailto:srangana@xxxxxxxxxx>> wrote:
> >>>
> >>>    With the changes present in [1] and [2],
> >>>
> >>>    A short explanation of the change would be, we encode the subvol ID in
> >>>    the d_off, losing 'n + 1' bits in case the high order n+1 bits of the
> >>>    underlying xlator returned d_off is not free. (Best to read the commit
> >>>    message for [1] :) )
> >>>
> >>>    Although not related to the latest patch, here is something to consider
> >>>    for the future:
> >>>
> >>>    We now have DHT, AFR, EC(?), DHT over DHT (Tier) which need subvol
> >>>    encoding in the returned readdir offset. Due to this, the loss in bits
> >>>    _may_ cause unwanted offset behavior, when used in the current scheme.
> >>>    As we would end up eating more bits than what we do at present.
> >>>
> >>>    Or IOW, we could be invalidating the assumption "both EXT4/XFS are
> >>>    tolerant in terms of the accuracy of the value presented
> >>>    back in seekdir().
> >>>
> >>>
> >>>XFS has not been a problem, since it always returns 32bit d_off. With
> >>>Ext4, it has been noted that it is tolerant to sacrificing the lower
> >>>bits in accuracy.
> >>>
> >>>    i.e, a seekdir(val) actually seeks to the entry which
> >>>    has the "closest" true offset."
> >>>
> >>>    Should we reconsider an in memory _cookie_ like approach that can help
> >>>    in this case?
> >>>
> >>>    It would invalidate (some or all based on the implementation) the
> >>>    following constraints that the current design resolves, (from, [1])
> >>>    - Nothing to "remember in memory" or evict "old entries".
> >>>    - Works fine across NFS server reboots and also NFS head failover.
> >>>    - Tolerant to seekdir() to arbitrary locations.
> >>>
> >>>    But, would provide a more reliable readdir offset for use (when valid
> >>>    and not evicted, say).
> >>>
> >>>    How would NFS adapt to this? Does Ganesha need a better scheme when
> >>>    doing multi-head NFS fail over?
> >>>
> >>>
> >>>Ganesha just offloads the responsibility to the FSAL layer to give
> >>>stable dir cookies (as it rightly should)
> >>>
> >>>
> >>>    Thoughts?
> >>>
> >>>
> >>>I think we need to analyze the actual assumption/problem here.
> >>>Remembering things in memory comes with the limitations you note above,
> >>>and may after all, still not be necessary. Let's look at the two
> >>>approaches taken:
> >>>
> >>>- Small backend offsets: like XFS, the offsets fit in 32bits, and we are
> >>>left with another 32bits of freedom to encode what we want. There is no
> >>>problem here until our nested encoding requirements cross 32bits of
> >>>space. So let's ignore this for now.
> >>>
> >>>- Large backend offsets: Ext4 being the primary target. Here we observe
> >>>that the backend filesystem is tolerant to sacrificing the accuracy of
> >>>lower bits. So we overwrite the lower bits with our subvolume encoding
> >>>information, and the number of bits used to encode is implicit in the
> >>>subvolume cardinality of that translator. While this works fine with a
> >>>single transformation, it is clearly a problem when the transformation
> >>>is nested with the same algorithm. The reason is quite simple: while the
> >>>lower bits were disposable when the cookie was taken fresh from Ext4,
> >>>once transformed the same lower bits are now "holy" and cannot be
> >>>overwritten carelessly, at least without dire consequences. The higher
> >>>level xlators need to take up the "next higher bits", past the previous
> >>>transformation boundary, to encode the next subvolume information. Once
> >>>the d_off transformation algorithms are fixed to give such due "respect"
> >>>to the lower layer's transformation and use a different real estate, we
> >>>might actually notice that the problem may not need such a deep redesign
> >>>after all.
> >>
> >>Agreed, my lack of understanding though is how may bits can be
> >>sacrificed for ext4? I do not have that data, any pointers there
> >>would help. (did go through https://lwn.net/Articles/544520/ but
> >>that does not have the tolerance information in it)
> >>
> >>Here is what I have as the current bits lost based on the following
> >>volume configuration,
> >>- 2 Tiers (DHT over DHT)
> >>- 128 subvols per DHT
> >>- Each DHT instance is either AFR or EC subvolumes, with 2 replicas
> >>and say 6 bricks per EC instance
> >>
> >>So EC side of the subvol needs log(2)6 (EC) + log(2)128 (DHT) +
> >>log(2)2 (Tier) = 3 + 7 + 1, or 11 bits of the actual d_off used to
> >>encode the volume, +1 for the high order bit to denote the encoding.
> >>(AFR would have 1 bit less, so we can consider just the EC side of
> >>things for the maximum loss computation at present)
> >>
> >>Is 12 bits still a tolerable loss for ext4? Or, till how many bits
> >>can we still use the current scheme?
> >
> >To a first approximation, assume ext4's offsets are random numbers (it's
> >actually some kind of weak cryptographic hash).  So in a directory with
> >n entries, you're basically picking n random numbers out of a hat and
> >hoping you never pick the same number twice.
> >
> >So following preshing.com/20110504/hash-collision-probabilities, the
> >chance that two entries in an n-entry directory will have the same b-bit
> >cookie are (for small n/2^b), roughly
> >
> >	n^2 / 2^(b+1)
> >
> >I think 52 bits is still a lot.  Even on million-entry directories I
> >only get a one-in-ten-thousand chance.   But play with the numbers and
> >make sure I haven't messed up somewhere.
> 
> Well I did the (different) math on the following premise, and here
> is what I have,
> 
> For even distribution of hashes in a given range	
> n	hash values generated (directory entries)
> N	maximum value in the hash range
> s	Separation distance of 2 values, (N/n)
> b	potential bits lost, but separation still preserved, floor(log(s)(base 2))
> 
> Basically as we are losing some bits to our encoding (lower bits of
> the d_off), given a 'k' (as the d_off) finding the nearest hash that
> it could represent, would mean losing less than 1/2 the bits of the
> distance separating 2 hash values (assuming even distribution of
> hash values). Which is what 'b' is above.
>
> With this and N set to represent a 64 bit range, we would need > ~16
> million entries, before losing 20 bits of information starts
> mattering.

I don't follow the argument, but in any case:

> Considering this, it seems safe enough against ext4 (assuming the
> hash function that is used has the property of value distribution
> evenly across the range (at least to a point))

These are cryptographic hashes so they behave a lot like random numbers,
and definitely won't be spaced in any convenient way.

The birthday paradox says that with a 44-bit hash we're more likely than
not to start seeing collisions somewhere around 2^22 directory entries.
That 16-million-entry-directory would have a lot of collisions.

Once a directory has a collision, though, we'll only notice if a client
happens to attempt a readdir starting at the bad offset.  If they're
reading through the whole directory they're probably doing it in chunks
of some thousand entries at a time so the probably of actually seeing a
bug is lower than my estimate.

--b.

> and assuming we even
> grow to 1/2 a million bricks in a gluster volume (~2^19).
> 
> Alert: Above math needs checking and verification
> 
> >
> >Anyway, a longer-term approach might be to fix the NFS protocol's
> >reliance on stable directory cookies, but of course then you have to
> >wait till the NFS clients get upgraded.
> 
> Also, other file systems may choose to use the d_off cookie
> differently, but I assume that is handled when it appears on the
> horizon.
> 
> >
> >--b.
> 
> Thanks,
> Shyam
> 
> >
> >>
> >>If we move to 1000/10000 node gluster in 4.0, assuming everything
> >>remains the same except DHT, we need an additional 3-5 bits for the
> >>DHT subvol encoding. Would this still survive the ext4 encoding
> >>scheme for d_off?
> >>
> >>>
> >>>Hope that helps
> >>>Thanks
> >>>
> >>>    Shyam
> >>>    [1] http://review.gluster.org/#/c/__4711/
> >>>    <http://review.gluster.org/#/c/4711/>
> >>>    [2] http://review.gluster.org/#/c/__8201/
> >>>    <http://review.gluster.org/#/c/8201/>
> >>>    _________________________________________________
> >>>    Gluster-devel mailing list
> >>>    Gluster-devel@xxxxxxxxxxx <mailto:Gluster-devel@xxxxxxxxxxx>
> >>>    http://supercolony.gluster.__org/mailman/listinfo/gluster-__devel
> >>>    <http://supercolony.gluster.org/mailman/listinfo/gluster-devel>
> >>>
> >>_______________________________________________
> >>Gluster-devel mailing list
> >>Gluster-devel@xxxxxxxxxxx
> >>http://supercolony.gluster.org/mailman/listinfo/gluster-devel
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-devel