Re: Readdir d_off encoding

Shyam <srangana@xxxxxxxxxx> · Tue, 16 Dec 2014 11:46:46 -0500

On 12/15/2014 09:06 PM, Anand Avati wrote:
Replies inline

On Mon Dec 15 2014 at 12:46:41 PM Shyam <srangana@xxxxxxxxxx
<mailto:srangana@xxxxxxxxxx>> wrote:

    With the changes present in [1] and [2],

    A short explanation of the change would be, we encode the subvol ID in
    the d_off, losing 'n + 1' bits in case the high order n+1 bits of the
    underlying xlator returned d_off is not free. (Best to read the commit
    message for [1] :) )

    Although not related to the latest patch, here is something to consider
    for the future:

    We now have DHT, AFR, EC(?), DHT over DHT (Tier) which need subvol
    encoding in the returned readdir offset. Due to this, the loss in bits
    _may_ cause unwanted offset behavior, when used in the current scheme.
    As we would end up eating more bits than what we do at present.

    Or IOW, we could be invalidating the assumption "both EXT4/XFS are
    tolerant in terms of the accuracy of the value presented
    back in seekdir().

XFS has not been a problem, since it always returns 32bit d_off. With
Ext4, it has been noted that it is tolerant to sacrificing the lower
bits in accuracy.

    i.e, a seekdir(val) actually seeks to the entry which
    has the "closest" true offset."

    Should we reconsider an in memory _cookie_ like approach that can help
    in this case?

    It would invalidate (some or all based on the implementation) the
    following constraints that the current design resolves, (from, [1])
    - Nothing to "remember in memory" or evict "old entries".
    - Works fine across NFS server reboots and also NFS head failover.
    - Tolerant to seekdir() to arbitrary locations.

    But, would provide a more reliable readdir offset for use (when valid
    and not evicted, say).

    How would NFS adapt to this? Does Ganesha need a better scheme when
    doing multi-head NFS fail over?

Ganesha just offloads the responsibility to the FSAL layer to give
stable dir cookies (as it rightly should)

    Thoughts?

I think we need to analyze the actual assumption/problem here.
Remembering things in memory comes with the limitations you note above,
and may after all, still not be necessary. Let's look at the two
approaches taken:

- Small backend offsets: like XFS, the offsets fit in 32bits, and we are
left with another 32bits of freedom to encode what we want. There is no
problem here until our nested encoding requirements cross 32bits of
space. So let's ignore this for now.

- Large backend offsets: Ext4 being the primary target. Here we observe
that the backend filesystem is tolerant to sacrificing the accuracy of
lower bits. So we overwrite the lower bits with our subvolume encoding
information, and the number of bits used to encode is implicit in the
subvolume cardinality of that translator. While this works fine with a
single transformation, it is clearly a problem when the transformation
is nested with the same algorithm. The reason is quite simple: while the
lower bits were disposable when the cookie was taken fresh from Ext4,
once transformed the same lower bits are now "holy" and cannot be
overwritten carelessly, at least without dire consequences. The higher
level xlators need to take up the "next higher bits", past the previous
transformation boundary, to encode the next subvolume information. Once
the d_off transformation algorithms are fixed to give such due "respect"
to the lower layer's transformation and use a different real estate, we
might actually notice that the problem may not need such a deep redesign
after all.

Agreed, my lack of understanding though is how may bits can be 
sacrificed for ext4? I do not have that data, any pointers there would 
help. (did go through https://lwn.net/Articles/544520/ but that does not 
have the tolerance information in it)

Here is what I have as the current bits lost based on the following 
volume configuration,
- 2 Tiers (DHT over DHT)
- 128 subvols per DHT
- Each DHT instance is either AFR or EC subvolumes, with 2 replicas and 
say 6 bricks per EC instance

So EC side of the subvol needs log(2)6 (EC) + log(2)128 (DHT) + log(2)2 
(Tier) = 3 + 7 + 1, or 11 bits of the actual d_off used to encode the 
volume, +1 for the high order bit to denote the encoding. (AFR would 
have 1 bit less, so we can consider just the EC side of things for the 
maximum loss computation at present)

Is 12 bits still a tolerable loss for ext4? Or, till how many bits can 
we still use the current scheme?

If we move to 1000/10000 node gluster in 4.0, assuming everything 
remains the same except DHT, we need an additional 3-5 bits for the DHT 
subvol encoding. Would this still survive the ext4 encoding scheme for 
d_off?

Hope that helps
Thanks

    Shyam
    [1] http://review.gluster.org/#/c/__4711/
    <http://review.gluster.org/#/c/4711/>
    [2] http://review.gluster.org/#/c/__8201/
    <http://review.gluster.org/#/c/8201/>
    _________________________________________________
    Gluster-devel mailing list
    Gluster-devel@xxxxxxxxxxx <mailto:Gluster-devel@xxxxxxxxxxx>
    http://supercolony.gluster.__org/mailman/listinfo/gluster-__devel
    <http://supercolony.gluster.org/mailman/listinfo/gluster-devel>

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-devel