Re: [RFC ] dictionary optimizations

Xavier Hernandez <xhernandez@xxxxxxxxxx> · Fri, 06 Sep 2013 10:46:23 +0200



    Al 04/09/13 18:10, En/na Anand Avati ha
      escrit:

    
      On Wed, Sep 4, 2013 at 6:37 AM, Xavier Hernandez <xhernandez@xxxxxxxxxx>
        wrote:

        
            Al
              04/09/13 14:05, En/na Jeff Darcy ha escrit:
              
                
                    On 09/04/2013 04:27 AM, Xavier Hernandez wrote:

                    
                      I would also like to note that each node can store
                      multiple elements.

                      Current implementation creates a node for each
                      byte in the key. In my

                      implementation I only create a node if there is a
                      prefix coincidence between

                      2 or more keys. This reduces the number of nodes
                      and the number of

                      indirections.

                    
                    Whatever we do, we should try to make sure that the
                    changes are profiled

                    against real usage.  When I was making my own dict
                    optimizations back in March

                    of last year, I started by looking at how they're
                    actually used. At that time,

                    a significant majority of dictionaries contained
                    just one item. That's why I

                    only implemented a simple mechanism to pre-allocate
                    the first data_pair instead

                    of doing something more ambitious.  Even then, the
                    difference in actual

                    performance or CPU usage was barely measurable.
                     Dict usage has certainly

                    changed since then, but I think you'd still be hard
                    pressed to find a case

                    where a single dict contains more than a handful of
                    entries, and approaches

                    that are optimized for dozens to hundreds might well
                    perform worse than simple

                    ones (e.g. because of cache aliasing or branch
                    misprediction).

                    
                    If you're looking for other optimization
                    opportunities that might provide even

                    bigger "bang for the buck" then I suggest that
                    stack-frame or frame->local

                    allocations are a good place to start.  Or string
                    copying in places like

                    loc_copy.  Or the entire fd_ctx/inode_ctx subsystem.
                     Let me know and I'll come

                    up with a few more.  To put a bit of a positive spin
                    on things, the GlusterFS

                    code offers many opportunities for improvement in
                    terms of CPU and memory

                    efficiency (though it's surprisingly still way
                    better than Ceph in that regard).

                    
              Yes. The optimizations on dictionary structures are not a
              big improvement in the overall performance of GlusterFS. I
              tried it on a real situation and the benefit was only
              marginal. However I didn't test new features like an
              atomic lookup and remove if found (because I would have
              had to review all the code). I think this kind of
              functionalities could improve a bit more the results I
              obtained.

              
              However this is not the only reason to do these changes.
              While I've been writing code I've found that it's tedious
              to do some things just because there isn't such functions
              in dict_t. Some actions require multiple calls, having to
              check multiple errors and adding complexity and limiting
              readability of the code. Many of these situations could be
              solved using functions similar to what I proposed.

              
              On the other side, if dict_t must be truly considered a
              concurrent structure, there are a lot of race conditions
              that might appear when doing some operations. It would
              require a great effort to take care of all these
              possibilities everywhere. It would be better to pack most
              of these situations into functions inside the dict_t
              itself where it is easier to combine some operations.

              
              By the way, I've made some tests with multiple bricks and
              it seems that there is a clear speed loss on directory
              listings as the number of bricks increases. Since bricks
              should be independent and they can work in parallel, I
              didn't expected such a big performance degradation.
            

            The likely reason is that, even though bricks are
              parallel for IO, readdir is essentially a sequential
              operation and DHT has a limitation that a readdir reply
              batch does not cross server boundaries. So if you have 10
              files and 1 server, all 10 entries are returned in one
              call to the app/libc. If you have 10 files and 10 servers
              evenly distributed, the app/libc has to perform 10 calls
              and keeps getting one file at a time. This problem goes
              away when each server has enough files to fill up a
              readdir batch. It's only when you have too few files and
              too many servers that this "dilution" problem shows up.
              However, this is just a theory and your problem may be
              something else too..
            

    I didn't know that DHT was doing a sequential brick scan on
    readdir(p) (my fault). Why is that ? Why it cannot return entries
    crossing a server boundary ? is it due to a technical reason or is
    it only due to the current implementation ?

    
    I've made a test using only directories (50 directories with 50
    subdirectories each). I started with one brick and I measured the
    time to do a recursive 'ls'. Then I sequentially added an additional
    brick, up to 6 (all of them physically independent), and repeated
    the ls. The time increases linearly as the number of bricks
    augments. As more bricks were added, the rebalancing time was also
    growing linearly.

    
    I think this is a big problem for scalability. It can be partially
    hidden by using some caching or preloading mechanisms, but it will
    be there and it will hit sooner or later.

    
            Note that Brian Foster's readdir-ahead patch should
              address this problem to a large extent. When loaded on top
              of DHT, the prefiller effectively collapses the smaller
              chunks returned by DHT into a larger chunk requested by
              the app/libc.
            

    I've seen it, however I think it will only partially mitigate and
    hide an existing problem. Imagine you have some hundreds or a
    thousand of bricks. I doubt readdir-ahead or anything else can hide
    the enormous latency that the sequential DHT scan will generate in
    that case.

    
    The main problem I see is that the full directory structure is read
    many times sequentially. I think it would be better to do the
    readdir(p) calls in parallel and combine them (possibly in
    background). This way the time to scan the directory structure would
    be almost constant, independently of the number of bricks.

    
    Xavi

    
            Avati
            

            However
              the tests have not been exhaustive nor made in best
              conditions so they might be misleading. Anyway it seems to
              me that there might be a problem with some mutexes that
              force too much serialization of requests (though I have no
              real proves it's only a feeling). Maybe some more
              "asynchronousity" on calls between translators could help.

              
              Only some thoughts...

              
              Best regards,

              
              Xavi
              
                
                    _______________________________________________

                    Gluster-devel mailing list

                    Gluster-devel@xxxxxxxxxx

                    https://lists.nongnu.org/mailman/listinfo/gluster-devel

                  
                  _______________________________________________

                  Gluster-devel mailing list

                  Gluster-devel@xxxxxxxxxx

                  https://lists.nongnu.org/mailman/listinfo/gluster-devel

                
      _______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxx
https://lists.nongnu.org/mailman/listinfo/gluster-devel