Re: Readdir d_off encoding

Xavier Hernandez <xhernandez@xxxxxxxxxx> · Tue, 23 Dec 2014 11:20:31 +0100

On 12/22/2014 06:41 PM, Jeff Darcy wrote:
An alternative would be to convert directories into regular files from
the brick point of view.

The benefits of this would be:

* d_off would be controlled by gluster, so all bricks would have the
same d_off and order. No need to use any d_off mapping or transformation.

I don't think a full-out change from real directories to virtual ones is
in the cards, but a variant of this idea might be worth exploring further.
If we had a *server side* component to map between on-disk d_off values
and those we present to clients, then it might be able to do a better job
than the local FS of ensuring uniqueness within the bits (e.g. 48 of them)
that are left over after we subtract some for a brick ID.  This could be
enough to make the bit-stealing approach (on the client) viable.  There
are probably some issues with failing over between replicas, which should
have the same files but might not have assigned the same internal d_off
values, but those issues might be avoidable if the d_off values are
deterministic with respect to GFIDs.

Having a server-side xlator seems a better approximation, however I see 
some problems that need to be solved:

The mapper should work on the fly (i.e. it should do the mapping between 
the local d_off to the client d_off without having full knowledge of the 
directory contents). This is a good approach for really big directories 
because doesn't require to waste large amounts of memory, but it will be 
hard to find a way to avoid duplicates, specially if are limited to ~48 
bits.

Making it based on the GFID would be a good way to have common d_off 
between bricks, however maintaining order will be harder. It will also 
be hard to guarantee uniqueness if mapping is deterministic and 
directory is very big. Otherwise it would need to read full directory 
contents before returning mapped d_off's.

To minimize the collision problem, we need to solve the ordering 
problem. If we can guarantee that all bricks return directory entries in 
the same order and d_off, we don't need to reserve some bits in d_off.

I think the virtual directories solution should be the one to consider 
for 4.0. For earlier versions we can try to find an intermediate solution.

Following your idea of a server side component, could this be useful ?

* Keep all directories and its entries in a double linked list stored in 
xattr of each inode.

* Use this linked list to build the readdir answer.

* Use the first 64 (or 63) bits of gfid as the d_off.

* There will be two special offsets: 0 for '.' and 1 for '..'

Example (using shorter gfid's for simplicity):

Directory root with gfid 0001
Directory 'test1' inside root with gfid 1111
Directory 'test2' inside root with gfid 2222
Entry 'entry1' inside 'test1' with gfid 3333
Entry 'entry2' inside 'test1' with gfid 4444
Entry 'entry3' inside 'test2' with gfid 4444
Entry 'entry4' inside 'test2' with gfid 5555
Entry 'entry5' inside 'test2' with gfid 6666

/ (0001)
  test1/ (1111)
    entry1 (3333)
    entry2 (4444)
  test2/ (2222)
    entry3 (4444)
    entry4 (5555)
    entry5 (6666)

Note that entry2 and entry3 are hardlinks.

xattrs of root (0001):
   trusted.dirmap.0001.next = 1111
   trusted.dirmap.0001.prev = 2222

xattrs of 'test1' (1111):
   trusted.dirmap.0001.next = 2222
   trusted.dirmap.0001.prev = 0001
   trusted.dirmap.1111.next = 3333
   trusted.dirmap.1111.prev = 4444

xattrs of 'test2' (2222):
   trusted.dirmap.0001.next = 0001
   trusted.dirmap.0001.prev = 1111
   trusted.dirmap.2222.next = 4444
   trusted.dirmap.2222.prev = 6666

xattrs of 'entry1' (3333):
   trusted.dirmap.1111.next = 4444
   trusted.dirmap.1111.prev = 1111

xattrs of 'entry2'/'entry3' (4444):
   trusted.dirmap.1111.next = 1111
   trusted.dirmap.1111.prev = 3333
   trusted.dirmap.2222.next = 5555
   trusted.dirmap.2222.prev = 2222

xattrs of 'entry4' (5555):
   trusted.dirmap.2222.next = 6666
   trusted.dirmap.2222.prev = 4444

xattrs of 'entry5' (6666):
   trusted.dirmap.2222.next = 2222
   trusted.dirmap.2222.prev = 5555

It's easy to enumerate all entries from the beginning of a directory. 
Also, since we return extra information from each inode in a directory, 
accessing these new xattrs doesn't represent a big impact.

Given a random d_off, it's relatively easy to find a gfid that starts 
with d_off and belongs to the directory (using .glusterfs/xx/yy/...). It 
could also be easy to implement the "nearest" of d_off.

This mechanism guarantees the same order and d_off on all bricks 
(assuming that directory modification operations are done with some 
locks held), and all management is made locally to each brick, being 
transparent to the clients. It also has a very low probability of 
collisions (we can use 63 or 64 bits for d_off instead of 48) and does 
not require any transformation in dht/afr/ec.

However some special operation would need to be implemented to allow 
healing procedures to correctly add a directory entry in the correct 
place to maintain order between bricks when things are not ok.

It adds some complexity to ensure integrity, but since this job is done 
locally on each brick, it seems easier to maintain.

Xavi
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel