Re: readdir() scalability (was Re: [RFC ] dictionary optimizations)

Xavier Hernandez <xhernandez@xxxxxxxxxx> · Thu, 12 Sep 2013 12:08:31 +0200

Al 09/09/13 17:25, En/na Vijay Bellur ha escrit:
On 09/09/2013 02:18 PM, Xavier Hernandez wrote:
Al 06/09/13 20:43, En/na Anand Avati ha escrit:

On Fri, Sep 6, 2013 at 1:46 AM, Xavier Hernandez
<xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>> wrote:

    Al 04/09/13 18:10, En/na Anand Avati ha escrit:
    On Wed, Sep 4, 2013 at 6:37 AM, Xavier Hernandez
    <xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>> wrote:

        Al 04/09/13 14:05, En/na Jeff Darcy ha escrit:

            On 09/04/2013 04:27 AM, Xavier Hernandez wrote:

                I would also like to note that each node can store
                multiple elements.
                Current implementation creates a node for each byte
                in the key. In my
                implementation I only create a node if there is a
                prefix coincidence between
                2 or more keys. This reduces the number of nodes and
                the number of
                indirections.

            Whatever we do, we should try to make sure that the
            changes are profiled
            against real usage.  When I was making my own dict
            optimizations back in March
            of last year, I started by looking at how they're
            actually used. At that time,
            a significant majority of dictionaries contained just one
            item. That's why I
            only implemented a simple mechanism to pre-allocate the
            first data_pair instead
            of doing something more ambitious.  Even then, the
            difference in actual
            performance or CPU usage was barely measurable. Dict
            usage has certainly
            changed since then, but I think you'd still be hard
            pressed to find a case
            where a single dict contains more than a handful of
            entries, and approaches
            that are optimized for dozens to hundreds might well
            perform worse than simple
            ones (e.g. because of cache aliasing or branch
            misprediction).

            If you're looking for other optimization opportunities
            that might provide even
            bigger "bang for the buck" then I suggest that
            stack-frame or frame->local
            allocations are a good place to start.  Or string copying
            in places like
            loc_copy.  Or the entire fd_ctx/inode_ctx subsystem.  Let
            me know and I'll come
            up with a few more.  To put a bit of a positive spin on
            things, the GlusterFS
            code offers many opportunities for improvement in terms
            of CPU and memory
            efficiency (though it's surprisingly still way better
            than Ceph in that regard).

        Yes. The optimizations on dictionary structures are not a big
        improvement in the overall performance of GlusterFS. I tried
        it on a real situation and the benefit was only marginal.
        However I didn't test new features like an atomic lookup and
        remove if found (because I would have had to review all the
        code). I think this kind of functionalities could improve a
        bit more the results I obtained.

        However this is not the only reason to do these changes.
        While I've been writing code I've found that it's tedious to
        do some things just because there isn't such functions in
        dict_t. Some actions require multiple calls, having to check
        multiple errors and adding complexity and limiting
        readability of the code. Many of these situations could be
        solved using functions similar to what I proposed.

        On the other side, if dict_t must be truly considered a
        concurrent structure, there are a lot of race conditions that
        might appear when doing some operations. It would require a
        great effort to take care of all these possibilities
        everywhere. It would be better to pack most of these
        situations into functions inside the dict_t itself where it
        is easier to combine some operations.

        By the way, I've made some tests with multiple bricks and it
        seems that there is a clear speed loss on directory listings
        as the number of bricks increases. Since bricks should be
        independent and they can work in parallel, I didn't expected
        such a big performance degradation.

    The likely reason is that, even though bricks are parallel for
    IO, readdir is essentially a sequential operation and DHT has a
    limitation that a readdir reply batch does not cross server
    boundaries. So if you have 10 files and 1 server, all 10 entries
    are returned in one call to the app/libc. If you have 10 files
    and 10 servers evenly distributed, the app/libc has to perform 10
    calls and keeps getting one file at a time. This problem goes
    away when each server has enough files to fill up a readdir
    batch. It's only when you have too few files and too many servers
    that this "dilution" problem shows up. However, this is just a
    theory and your problem may be something else too..

    I didn't know that DHT was doing a sequential brick scan on
    readdir(p) (my fault). Why is that ? Why it cannot return entries
    crossing a server boundary ? is it due to a technical reason or is
    it only due to the current implementation ?

    I've made a test using only directories (50 directories with 50
    subdirectories each). I started with one brick and I measured the
    time to do a recursive 'ls'. Then I sequentially added an
    additional brick, up to 6 (all of them physically independent),
    and repeated the ls. The time increases linearly as the number of
    bricks augments. As more bricks were added, the rebalancing time
    was also growing linearly.

    I think this is a big problem for scalability. It can be partially
    hidden by using some caching or preloading mechanisms, but it will
    be there and it will hit sooner or later.

    Note that Brian Foster's readdir-ahead patch should address this
    problem to a large extent. When loaded on top of DHT, the
    prefiller effectively collapses the smaller chunks returned by
    DHT into a larger chunk requested by the app/libc.

    I've seen it, however I think it will only partially mitigate and
    hide an existing problem. Imagine you have some hundreds or a
    thousand of bricks. I doubt readdir-ahead or anything else can
    hide the enormous latency that the sequential DHT scan will
    generate in that case.

    The main problem I see is that the full directory structure is
    read many times sequentially. I think it would be better to do the
    readdir(p) calls in parallel and combine them (possibly in
    background). This way the time to scan the directory structure
    would be almost constant, independently of the number of bricks.

The design of the directory entries in DHT makes this essentially a
sequential operation because entries from servers are appended, not
striped. What I mean is, the logical ordering of

All entries in a directory = All files and dirs in 0th server + All
files (no dirs) in 1st server + All files (no dirs) in 2nd server + ..
+ All files (no dirs) in N'th server.

in a sequential manner. If we read the entries of 2nd server along
with entries of 1st server, we cannot "use" it till we finish reading
all entries of 1st server and get EOD from it - which is why
readdir-ahead is a more natural solution than reading in parallel for
the above design.

As I understand it, what the read-ahead translator does is to collect
one or more answers from the DHT translator and combine them to return a
single answer as big as possible. If that is correct, it will certainly
reduce the number of readdir calls from application, however I think it
will still have a considerable latency when used on big clusters. Anyway
I don't have any measurement or valid argument to support this, so lets
see how readdir-ahead works in real environments before discussing 
about it.

Also, this is a problem only if each server has fewer entries than
what can be returned in a single readdir() request by the application.
As long as the server has more than this "minimum threshold" of number
of files, the number of batched readdir() made by the client is going
to be fixed, and those various requests will be spread across various
servers (as opposed to, sending them all to the same server).

I've seen customers with large amounts of empty, or almost empty,
directories. Don't ask me why, I don't understand it either...

So yes, as you add servers for a given small set of files the
scalability drops, but that is only till you create more files, when
the # of servers stop mattering again.

Can you share the actual numbers from the tests you ran?

I've made the tests in 6 physical servers (Quad Atom D525 1.8 GHz. These
are the only servers I can use regularly to do tests) connected through
a dedicated 1 Gbit switch. Bricks are stored in 1TB SATA disks with ZFS.
One of the servers was also used as a client to do the tests.

Initially I created a volume with a single brick. I initialized the
volume with 50 directories with 50 subdirectories each (a total of 2500
directories). No files.

Have you tried turning on "cluster.readdir-optimize"? This could help 
improve readdir performance for the directory hierarchy that you 
describe.

I repeated the tests with this option enabled and it really improved 
readdir performance, however it still shows a linear speed loss as the 
number of bricks increases. Will the readdir-ahead translator be able to 
hide this linear effect when the number of bricks is very high ?

Results of the tests with cluser.readdir-optimize active:

1 brick: 11.8 seconds
2 bricks: 15.4 seconds
3 bricks: 17.9 seconds
4 bricks: 20.6 seconds
5 bricks: 22.9 seconds
6 bricks: 25.4 seconds
12 bricks: 41.8 seconds

Rebalance also improved:

From 1 to 2 bricks: 77 seconds
From 2 to 3 bricks: 78 seconds
From 3 to 4 bricks: 81 seconds
From 4 to 5 bricks: 84 seconds
From 5 to 6 bricks: 87 seconds
From 6 to 12 bricks: 144 seconds

Xavi

-Vijay

After each test, I added a new brick and started a rebalance. Once the
rebalance was completed, I umounted and stopped the volume and restarted
it again.

The test consisted of 4 'time ls -lR /<testdir> | wc -l'. The first
result was discarded. The result shown below is the mean of the other 3
results.

1 brick: 11.8 seconds
2 bricks: 19.0 seconds
3 bricks: 23.8 seconds
4 bricks: 29.8 seconds
5 bricks: 34.6 seconds
6 bricks: 41.0 seconds
12 bricks (2 bricks on each server): 78.5 seconds

The rebalancing time also grew considerably (these times are the result
of a single rebalance. They might not be very accurate):

 From 1 to 2 bricks: 91 seconds
 From 2 to 3 bricks: 102 seconds
 From 3 to 4 bricks: 119 seconds
 From 4 to 5 bricks: 138 seconds
 From 5 to 6 bricks: 151 seconds
 From 6 to 12 bricks: 259 seconds

The number of disk IOPS didn't exceed 40 in any server in any case. The
network bandwidth didn't go beyond 6 Mbits/s between any pair of servers
and none of them reached 100% core usage.

Xavi

Avati

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxx
https://lists.nongnu.org/mailman/listinfo/gluster-devel

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxx
https://lists.nongnu.org/mailman/listinfo/gluster-devel

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxx
https://lists.nongnu.org/mailman/listinfo/gluster-devel