Re: Readdir d_off encoding

Anand Avati <avati@xxxxxxxxxxx> · Tue, 16 Dec 2014 18:12:19 +0000

On Tue Dec 16 2014 at 8:46:48 AM Shyam <srangana@xxxxxxxxxx> wrote:
On 12/15/2014 09:06 PM, Anand Avati wrote:

> Replies inline

>

> On Mon Dec 15 2014 at 12:46:41 PM Shyam <srangana@xxxxxxxxxx

> <mailto:srangana@xxxxxxxxxx>> wrote:

>

>     With the changes present in [1] and [2],

>

>     A short explanation of the change would be, we encode the subvol ID in

>     the d_off, losing 'n + 1' bits in case the high order n+1 bits of the

>     underlying xlator returned d_off is not free. (Best to read the commit

>     message for [1] :) )

>

>     Although not related to the latest patch, here is something to consider

>     for the future:

>

>     We now have DHT, AFR, EC(?), DHT over DHT (Tier) which need subvol

>     encoding in the returned readdir offset. Due to this, the loss in bits

>     _may_ cause unwanted offset behavior, when used in the current scheme.

>     As we would end up eating more bits than what we do at present.

>

>     Or IOW, we could be invalidating the assumption "both EXT4/XFS are

>     tolerant in terms of the accuracy of the value presented

>     back in seekdir().

>

>

> XFS has not been a problem, since it always returns 32bit d_off. With

> Ext4, it has been noted that it is tolerant to sacrificing the lower

> bits in accuracy.

>

>     i.e, a seekdir(val) actually seeks to the entry which

>     has the "closest" true offset."

>

>     Should we reconsider an in memory _cookie_ like approach that can help

>     in this case?

>

>     It would invalidate (some or all based on the implementation) the

>     following constraints that the current design resolves, (from, [1])

>     - Nothing to "remember in memory" or evict "old entries".

>     - Works fine across NFS server reboots and also NFS head failover.

>     - Tolerant to seekdir() to arbitrary locations.

>

>     But, would provide a more reliable readdir offset for use (when valid

>     and not evicted, say).

>

>     How would NFS adapt to this? Does Ganesha need a better scheme when

>     doing multi-head NFS fail over?

>

>

> Ganesha just offloads the responsibility to the FSAL layer to give

> stable dir cookies (as it rightly should)

>

>

>     Thoughts?

>

>

> I think we need to analyze the actual assumption/problem here.

> Remembering things in memory comes with the limitations you note above,

> and may after all, still not be necessary. Let's look at the two

> approaches taken:

>

> - Small backend offsets: like XFS, the offsets fit in 32bits, and we are

> left with another 32bits of freedom to encode what we want. There is no

> problem here until our nested encoding requirements cross 32bits of

> space. So let's ignore this for now.

>

> - Large backend offsets: Ext4 being the primary target. Here we observe

> that the backend filesystem is tolerant to sacrificing the accuracy of

> lower bits. So we overwrite the lower bits with our subvolume encoding

> information, and the number of bits used to encode is implicit in the

> subvolume cardinality of that translator. While this works fine with a

> single transformation, it is clearly a problem when the transformation

> is nested with the same algorithm. The reason is quite simple: while the

> lower bits were disposable when the cookie was taken fresh from Ext4,

> once transformed the same lower bits are now "holy" and cannot be

> overwritten carelessly, at least without dire consequences. The higher

> level xlators need to take up the "next higher bits", past the previous

> transformation boundary, to encode the next subvolume information. Once

> the d_off transformation algorithms are fixed to give such due "respect"

> to the lower layer's transformation and use a different real estate, we

> might actually notice that the problem may not need such a deep redesign

> after all.

Agreed, my lack of understanding though is how may bits can be

sacrificed for ext4? I do not have that data, any pointers there would

help. (did go through https://lwn.net/Articles/544520/ but that does not

have the tolerance information in it)

Here is what I have as the current bits lost based on the following

volume configuration,

- 2 Tiers (DHT over DHT)

- 128 subvols per DHT

- Each DHT instance is either AFR or EC subvolumes, with 2 replicas and

say 6 bricks per EC instance

So EC side of the subvol needs log(2)6 (EC) + log(2)128 (DHT) + log(2)2

(Tier) = 3 + 7 + 1, or 11 bits of the actual d_off used to encode the

volume, +1 for the high order bit to denote the encoding. (AFR would

have 1 bit less, so we can consider just the EC side of things for the

maximum loss computation at present)

Is 12 bits still a tolerable loss for ext4? Or, till how many bits can

we still use the current scheme?

If we move to 1000/10000 node gluster in 4.0, assuming everything

remains the same except DHT, we need an additional 3-5 bits for the DHT

subvol encoding. Would this still survive the ext4 encoding scheme for

d_off?

In theory, we need at least log_base2(#of bricks) bits for storing the information. If we are creative enough, in making the various layers co-operate, we could get away with just that minimum, independent of the number of xlator layers.

One example approach (not necessarily the best): Make every xlator knows the total number of leaf xlators (protocol/clients), and also the number of all leaf xlators from each of its subvolumes. This way, the protocol/client xlators (alone) do the encoding, by knowing its global brick# and total #of bricks. The cluster xlators blindly forward the readdir_cbk without any further transformations of the d_offs, and also route the next readdir(old_doff) request to the appropriate subvolume based on the weighted graph (of counts of protocol/clients in the subtrees) till it reaches the right protocol/client to resume the enumeration.

There may be better/even simpler approaches too (especially one that does not need global awareness of xlator counts), and finding such a stateless solution, and remaining NFS friendly is well worth the effort IMO.

Thanks

>

> Hope that helps

> Thanks

>

>     Shyam

>     [1] http://review.gluster.org/#/c/__4711/

>     <http://review.gluster.org/#/c/4711/>

>     [2] http://review.gluster.org/#/c/__8201/

>     <http://review.gluster.org/#/c/8201/>

>     _________________________________________________

>     Gluster-devel mailing list

>     Gluster-devel@xxxxxxxxxxx <mailto:Gluster-devel@gluster.org>

>     http://supercolony.gluster.__org/mailman/listinfo/gluster-__devel

>     <http://supercolony.gluster.org/mailman/listinfo/gluster-devel>

>

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-devel