Re: Readdir d_off encoding

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 12/18/2014 05:11 PM, Shyam wrote:
On 12/17/2014 05:04 AM, Xavier Hernandez wrote:
Just to consider all possibilities...

Current architecture needs to create all directory structure on all
bricks, and has the big problem that each directory in each brick will
store the files in different order and with different d_off values.

I gather that this is when EC or AFR is in place, as for DHT a file is
on one brick only.

Files are only in one brick, but directories are in all bricks. This is independent fram having ec or afr in place.

This makes directory access quite complex in some cases. For example if a readdir is made on one brick and that brick dies, the next readdir cannot be continued on another brick, at least without having to do some complex handling. This is the consequence of having a directory on each brick like if they were replicated, but this directories are not exactly equal.

Also this architecture forces ec to have directories replicated. This adds complexities



This is a serious scalability issue and have many inconveniences when
trying to heal or detect inconsistencies between bricks (basically we
would need to read full directory contents of each brick to compare
them).

I am not quite familiar with EC so pardon the ignorance.
Why/How does d_off play a role in this healing/crawling?

This problem is also present in afr. There are two easy to see problems:

* If multiple readdir requests are needed to get full contents of a directory and the brick to which the requests are being sent dies, the next readdir request cannot be sent to any other brick because the d_off field won't make sense on the other brick. This doesn't have an easy solution, so an error is returned instead to complete the directory listing. This is odd because in theory we have the directory replicated and this should happen (the same scenario but reading from a file is handled transparently to the client).

* If you need to detect the differences between the directory contents on different bricks (for example when you want to heal it), you will need to read full contents of the directory from each brick in memory, sort each list, and begin comparison. If that directory contains, for example, one million entries, that would need a huge amount of memory for an operation that seem more simple. If all bricks would return directory entries in the same order and same d_off, this procedure would need a lot less of memory and would be more efficient.



An alternative would be to convert directories into regular files from
the brick point of view.

The benefits of this would be:

* d_off would be controlled by gluster, so all bricks would have the
same d_off and order. No need to use any d_off mapping or transformation.

* Directories could take advantage of replication and disperse self-heal
procedures. They could be treated as files and be healed more easily. A
corrupted brick would not produce invalid directory contents, and file
duplication in directory listing would be avoided.

* Many of the complexities in DHT, AFR and EC to manage directories
would be removed.

The main issue could be the need of an upper level xlator that would
transform directory requests into file modifications and would be
responsible of managing all d_off assignment and directory manipulation
(renames, links, unlinks, ...).

This is tending towards some thoughts for Gluster 4.0 and specifically
DHT in 4.0. I am going to wait for the same/similar comments as we
discuss those specifics (hopefully published before Christmas (2014)).

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-devel




[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux