Re: readdir() scalability (was Re: [RFC ] dictionary optimizations)

Xavier Hernandez <xhernandez@xxxxxxxxxx> · Mon, 09 Sep 2013 10:48:21 +0200



    Al 06/09/13 20:43, En/na Anand Avati ha
      escrit:

    
          On Fri, Sep 6, 2013 at 1:46 AM,
            Xavier Hernandez <xhernandez@xxxxxxxxxx>
            wrote:

            
                Al 04/09/13 18:10, En/na Anand Avati ha escrit:

                
                      On Wed, Sep 4, 2013 at 6:37 AM,
                        Xavier Hernandez <xhernandez@xxxxxxxxxx>
                        wrote:

                        
                            Al 04/09/13
                              14:05, En/na Jeff Darcy ha escrit:
                              
                                
                                   On
                                    09/04/2013 04:27 AM, Xavier
                                    Hernandez wrote:

                                     I would
                                      also like to note that each node
                                      can store multiple elements.

                                      Current implementation creates a
                                      node for each byte in the key. In
                                      my

                                      implementation I only create a
                                      node if there is a prefix
                                      coincidence between

                                      2 or more keys. This reduces the
                                      number of nodes and the number of

                                      indirections.

                                    
                                    Whatever we do, we should try to
                                    make sure that the changes are
                                    profiled

                                    against real usage.  When I was
                                    making my own dict optimizations
                                    back in March

                                    of last year, I started by looking
                                    at how they're actually used. At
                                    that time,

                                    a significant majority of
                                    dictionaries contained just one
                                    item. That's why I

                                    only implemented a simple mechanism
                                    to pre-allocate the first data_pair
                                    instead

                                    of doing something more ambitious.
                                     Even then, the difference in actual

                                    performance or CPU usage was barely
                                    measurable.  Dict usage has
                                    certainly

                                    changed since then, but I think
                                    you'd still be hard pressed to find
                                    a case

                                    where a single dict contains more
                                    than a handful of entries, and
                                    approaches

                                    that are optimized for dozens to
                                    hundreds might well perform worse
                                    than simple

                                    ones (e.g. because of cache aliasing
                                    or branch misprediction).

                                    
                                    If you're looking for other
                                    optimization opportunities that
                                    might provide even

                                    bigger "bang for the buck" then I
                                    suggest that stack-frame or
                                    frame->local

                                    allocations are a good place to
                                    start.  Or string copying in places
                                    like

                                    loc_copy.  Or the entire
                                    fd_ctx/inode_ctx subsystem.  Let me
                                    know and I'll come

                                    up with a few more.  To put a bit of
                                    a positive spin on things, the
                                    GlusterFS

                                    code offers many opportunities for
                                    improvement in terms of CPU and
                                    memory

                                    efficiency (though it's surprisingly
                                    still way better than Ceph in that
                                    regard).

                                    
                              Yes. The optimizations on dictionary
                              structures are not a big improvement in
                              the overall performance of GlusterFS. I
                              tried it on a real situation and the
                              benefit was only marginal. However I
                              didn't test new features like an atomic
                              lookup and remove if found (because I
                              would have had to review all the code). I
                              think this kind of functionalities could
                              improve a bit more the results I obtained.

                              
                              However this is not the only reason to do
                              these changes. While I've been writing
                              code I've found that it's tedious to do
                              some things just because there isn't such
                              functions in dict_t. Some actions require
                              multiple calls, having to check multiple
                              errors and adding complexity and limiting
                              readability of the code. Many of these
                              situations could be solved using functions
                              similar to what I proposed.

                              
                              On the other side, if dict_t must be truly
                              considered a concurrent structure, there
                              are a lot of race conditions that might
                              appear when doing some operations. It
                              would require a great effort to take care
                              of all these possibilities everywhere. It
                              would be better to pack most of these
                              situations into functions inside the
                              dict_t itself where it is easier to
                              combine some operations.

                              
                              By the way, I've made some tests with
                              multiple bricks and it seems that there is
                              a clear speed loss on directory listings
                              as the number of bricks increases. Since
                              bricks should be independent and they can
                              work in parallel, I didn't expected such a
                              big performance degradation.
                            

                            The likely reason is that, even though
                              bricks are parallel for IO, readdir is
                              essentially a sequential operation and DHT
                              has a limitation that a readdir reply
                              batch does not cross server boundaries. So
                              if you have 10 files and 1 server, all 10
                              entries are returned in one call to the
                              app/libc. If you have 10 files and 10
                              servers evenly distributed, the app/libc
                              has to perform 10 calls and keeps getting
                              one file at a time. This problem goes away
                              when each server has enough files to fill
                              up a readdir batch. It's only when you
                              have too few files and too many servers
                              that this "dilution" problem shows up.
                              However, this is just a theory and your
                              problem may be something else too..
                            

                I didn't know that DHT was doing a sequential brick scan
                on readdir(p) (my fault). Why is that ? Why it cannot
                return entries crossing a server boundary ? is it due to
                a technical reason or is it only due to the current
                implementation ?

                
                I've made a test using only directories (50 directories
                with 50 subdirectories each). I started with one brick
                and I measured the time to do a recursive 'ls'. Then I
                sequentially added an additional brick, up to 6 (all of
                them physically independent), and repeated the ls. The
                time increases linearly as the number of bricks
                augments. As more bricks were added, the rebalancing
                time was also growing linearly.

                
                I think this is a big problem for scalability. It can be
                partially hidden by using some caching or preloading
                mechanisms, but it will be there and it will hit sooner
                or later.
                

                          Note that Brian Foster's readdir-ahead
                            patch should address this problem to a large
                            extent. When loaded on top of DHT, the
                            prefiller effectively collapses the smaller
                            chunks returned by DHT into a larger chunk
                            requested by the app/libc.
                          

                I've seen it, however I think it will only partially
                mitigate and hide an existing problem. Imagine you have
                some hundreds or a thousand of bricks. I doubt
                readdir-ahead or anything else can hide the enormous
                latency that the sequential DHT scan will generate in
                that case.

                
                The main problem I see is that the full directory
                structure is read many times sequentially. I think it
                would be better to do the readdir(p) calls in parallel
                and combine them (possibly in background). This way the
                time to scan the directory structure would be almost
                constant, independently of the number of bricks.

              
            The design of the directory entries in DHT makes this
              essentially a sequential operation because entries from
              servers are appended, not striped. What I mean is, the
              logical ordering of 
            

            All entries in a directory = All files and dirs in 0th
              server + All files (no dirs) in 1st server + All files (no
              dirs) in 2nd server + .. + All files (no dirs) in N'th
              server.
            

            in a sequential manner. If we read the entries of 2nd
              server along with entries of 1st server, we cannot "use"
              it till we finish reading all entries of 1st server and
              get EOD from it - which is why readdir-ahead is a more
              natural solution than reading in parallel for the above
              design.
            

    As I understand it, what the read-ahead translator does is to
    collect one or more answers from the DHT translator and combine them
    to return a single answer as big as possible. If that is correct, it
    will certainly reduce the number of readdir calls from application,
    however I think it will still have a considerable latency when used
    on big clusters. Anyway I don't have any measurement or valid
    argument to support this, so lets see how readdir-ahead works in
    real environments before discussing about it.

    
            Also, this is a problem only if each server has fewer
              entries than what can be returned in a single readdir()
              request by the application. As long as the server has more
              than this "minimum threshold" of number of files, the
              number of batched readdir() made by the client is going to
              be fixed, and those various requests will be spread across
              various servers (as opposed to, sending them all to the
              same server).
            

    I've seen customers with large amounts of empty, or almost empty,
    directories. Don't ask me why, I don't understand it either...

    
            So yes, as you add servers for a given small set of
              files the scalability drops, but that is only till you
              create more files, when the # of servers stop mattering
              again.
            

            Can you share the actual numbers from the tests you
              ran?
            

    I've made the tests in 6 physical servers (Quad Atom D525 1.8 GHz.
    These are the only servers I can use regularly to do tests)
    connected through a dedicated 1 Gbit switch. Bricks are stored in
    1TB SATA disks with ZFS. One of the servers was also used as a
    client to do the tests.

    
    Initially I created a volume with a single brick. I initialized the
    volume with 50 directories with 50 subdirectories each (a total of
    2500 directories). No files.

    
    After each test, I added a new brick and started a rebalance. Once
    the rebalance was completed, I umounted and stopped the volume and
    restarted it again.

    
    The test consisted of 4 'time ls -lR /<testdir> | wc -l'. The
    first result was discarded. The result shown below is the mean of
    the other 3 results.

    
    1 brick: 11.8 seconds

    2 bricks: 19.0 seconds

    3 bricks: 23.8 seconds

    4 bricks: 29.8 seconds

    5 bricks: 34.6 seconds

    6 bricks: 41.0 seconds

    12 bricks (2 bricks on each server): 78.5 seconds

    
    The rebalancing time also grew considerably (these times are the
    result of a single rebalance. They might not be very accurate):

    
    From 1 to 2 bricks: 91 seconds

    From 2 to 3 bricks: 102 seconds

    From 3 to 4 bricks: 119 seconds

    From 4 to 5 bricks: 138 seconds

    From 5 to 6 bricks: 151 seconds

    From 6 to 12 bricks: 259 seconds

    
    The number of disk IOPS didn't exceed 40 in any server in any case.
    The network bandwidth didn't go beyond 6 Mbits/s between any pair of
    servers and none of them reached 100% core usage.

    
    Xavi

    
            Avati
            

      _______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxx
https://lists.nongnu.org/mailman/listinfo/gluster-devel