Re: Readdir d_off encoding

Anand Avati <avati@xxxxxxxxxxx> · Tue, 23 Dec 2014 18:34:41 +0000

Using GFID does not work for d_off. The GFID represents and inode, and a d_off represents a directory entry. Therefore using GFID as an alternative to d_off breaks down when you have hardlinks for the same inode in a single directory.

On Tue Dec 23 2014 at 2:20:34 AM Xavier Hernandez <xhernandez@xxxxxxxxxx> wrote:
On 12/22/2014 06:41 PM, Jeff Darcy wrote:

>> An alternative would be to convert directories into regular files from

>> the brick point of view.

>>

>> The benefits of this would be:

>>

>> * d_off would be controlled by gluster, so all bricks would have the

>> same d_off and order. No need to use any d_off mapping or transformation.

>

> I don't think a full-out change from real directories to virtual ones is

> in the cards, but a variant of this idea might be worth exploring further.

> If we had a *server side* component to map between on-disk d_off values

> and those we present to clients, then it might be able to do a better job

> than the local FS of ensuring uniqueness within the bits (e.g. 48 of them)

> that are left over after we subtract some for a brick ID.  This could be

> enough to make the bit-stealing approach (on the client) viable.  There

> are probably some issues with failing over between replicas, which should

> have the same files but might not have assigned the same internal d_off

> values, but those issues might be avoidable if the d_off values are

> deterministic with respect to GFIDs.

Having a server-side xlator seems a better approximation, however I see

some problems that need to be solved:

The mapper should work on the fly (i.e. it should do the mapping between

the local d_off to the client d_off without having full knowledge of the

directory contents). This is a good approach for really big directories

because doesn't require to waste large amounts of memory, but it will be

hard to find a way to avoid duplicates, specially if are limited to ~48

bits.

Making it based on the GFID would be a good way to have common d_off

between bricks, however maintaining order will be harder. It will also

be hard to guarantee uniqueness if mapping is deterministic and

directory is very big. Otherwise it would need to read full directory

contents before returning mapped d_off's.

To minimize the collision problem, we need to solve the ordering

problem. If we can guarantee that all bricks return directory entries in

the same order and d_off, we don't need to reserve some bits in d_off.

I think the virtual directories solution should be the one to consider

for 4.0. For earlier versions we can try to find an intermediate solution.

Following your idea of a server side component, could this be useful ?

* Keep all directories and its entries in a double linked list stored in

xattr of each inode.

* Use this linked list to build the readdir answer.

* Use the first 64 (or 63) bits of gfid as the d_off.

* There will be two special offsets: 0 for '.' and 1 for '..'

Example (using shorter gfid's for simplicity):

Directory root with gfid 0001

Directory 'test1' inside root with gfid 1111

Directory 'test2' inside root with gfid 2222

Entry 'entry1' inside 'test1' with gfid 3333

Entry 'entry2' inside 'test1' with gfid 4444

Entry 'entry3' inside 'test2' with gfid 4444

Entry 'entry4' inside 'test2' with gfid 5555

Entry 'entry5' inside 'test2' with gfid 6666

/ (0001)

   test1/ (1111)

     entry1 (3333)

     entry2 (4444)

   test2/ (2222)

     entry3 (4444)

     entry4 (5555)

     entry5 (6666)

Note that entry2 and entry3 are hardlinks.

xattrs of root (0001):

    trusted.dirmap.0001.next = 1111

    trusted.dirmap.0001.prev = 2222

xattrs of 'test1' (1111):

    trusted.dirmap.0001.next = 2222

    trusted.dirmap.0001.prev = 0001

    trusted.dirmap.1111.next = 3333

    trusted.dirmap.1111.prev = 4444

xattrs of 'test2' (2222):

    trusted.dirmap.0001.next = 0001

    trusted.dirmap.0001.prev = 1111

    trusted.dirmap.2222.next = 4444

    trusted.dirmap.2222.prev = 6666

xattrs of 'entry1' (3333):

    trusted.dirmap.1111.next = 4444

    trusted.dirmap.1111.prev = 1111

xattrs of 'entry2'/'entry3' (4444):

    trusted.dirmap.1111.next = 1111

    trusted.dirmap.1111.prev = 3333

    trusted.dirmap.2222.next = 5555

    trusted.dirmap.2222.prev = 2222

xattrs of 'entry4' (5555):

    trusted.dirmap.2222.next = 6666

    trusted.dirmap.2222.prev = 4444

xattrs of 'entry5' (6666):

    trusted.dirmap.2222.next = 2222

    trusted.dirmap.2222.prev = 5555

It's easy to enumerate all entries from the beginning of a directory.

Also, since we return extra information from each inode in a directory,

accessing these new xattrs doesn't represent a big impact.

Given a random d_off, it's relatively easy to find a gfid that starts

with d_off and belongs to the directory (using .glusterfs/xx/yy/...). It

could also be easy to implement the "nearest" of d_off.

This mechanism guarantees the same order and d_off on all bricks

(assuming that directory modification operations are done with some

locks held), and all management is made locally to each brick, being

transparent to the clients. It also has a very low probability of

collisions (we can use 63 or 64 bits for d_off instead of 48) and does

not require any transformation in dht/afr/ec.

However some special operation would need to be implemented to allow

healing procedures to correctly add a directory entry in the correct

place to maintain order between bricks when things are not ok.

It adds some complexity to ensure integrity, but since this job is done

locally on each brick, it seems easier to maintain.

Xavi

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel