Re: [RFC ] dictionary optimizations

Xavier Hernandez <xhernandez@xxxxxxxxxx> · Wed, 04 Sep 2013 10:27:24 +0200



    I would also like to note that each
      node can store multiple elements. Current implementation creates a
      node for each byte in the key. In my implementation I only create
      a node if there is a prefix coincidence between 2 or more keys.
      This reduces the number of nodes and the number of indirections.

      
      Each node stores the key from the beginning to allow very fast
      lookup by a single memcmp().

      
      In normal situations, this implementation is very fast and very
      efficient in space.

      
      Greetings,

      
      Xavi

      
      Al 04/09/13 10:10, En/na Xavier Hernandez ha escrit:

    
      Al 04/09/13 02:55, En/na Anand Avati
        ha escrit:

      
            On Tue, Sep 3, 2013 at 1:42 AM,
              Xavier Hernandez <xhernandez@xxxxxxxxxx>
              wrote:

              
                  Al 03/09/13 09:33, En/na Anand Avati ha escrit:

                  
                      On Mon, Sep 2, 2013 at 7:24 AM,
                        Xavier Hernandez <xhernandez@xxxxxxxxxx>
                        wrote:

                        
                            Hi,

                              
                              dict_t structures are widely used in
                              glusterfs. I've some ideas that could
                              improve its performance.

                              
                              * On delete operations, return the current
                              value if it exists.

                              
                              This is very useful when we want to get a
                              value and remove it from the dictionary.
                              This way it can be done accessing and
                              locking the dict_t only once (and it is
                              atomic).

                            
                            Makes sense.
                            

                             * On add
                              operations, return the previous value if
                              it existed.

                              
                              This avoids to use a lookup and a
                              conditional add (and it is atomic).

                            
                            Do you mean dict_set()? If so, how do
                              you propose we differentiate between
                              "failure" and "previous value did not
                              exist"? Do you propose setting the
                              previous value into a pointer to pointer,
                              and retain the return value as is today?
                          
                        
                  Yes, I'm thinking to something similar to dict_set()
                  (by the way, I would remove the dict_add() function).
              
              
              dict_add() is used in unserialization
                routines where dict_set() for a big set of keys
                guaranteed not to repeat is very expensive
                (unserializing would otherwise have a quadratic function
                as its asymptote). What is the reason you intend to
                remove it?
            
          
      Yes, but it is used only inside dict.c itself. It should not be
      published in the dict.h. This is only a cosmetic change but I
      think it would be better. Any other use of dict_add() by a
      translator will be incorrect unless it is made with much care,
      which is not desirable for future maintenance.

      
                 What you propose
                  would be the simplest solution right now. However I
                  think it would be interesting to change the return
                  value to an error code (this would supply more
                  detailed information in case of failure and we could
                  use EEXIST to know if the value already existed. In
                  fact I think it would be interesting to progressively
                  change the -1 return code of many functions by an
                  error code). The pointer to pointer argument could be
                  NULL if the previous value is not needed.

                  
                  Of course this would change the function signature,
                  breaking a lot of existing code. Another possibility
                  could be to create a dict_replace() function, and
                  possibly make it to fail if the value didn't exist.
              
              
              It is best we do not change the meaning of
                existing APIs, and just add new APIs instead. The new
                API can be:
              

              int dict_replace (dict_t *dict, const char
                *key, data_t *newval, data_t **oldval);
              

              .. and leave dict_set() as is.
               
            
      That would be good. I would allow that dict_replace() sets *oldval
      to NULL to indicate that the key did not exist, and maintain the
      0/-1 return code to indicate error (this will maintain homogeneity
      with other APIs, though I still think that an error code would be
      more useful in general, but this would be another topic).

      
                             * Always
                              return the data_pair_t structure instead
                              of data_t or the data itself.

                              
                              This can be useful to avoid future lookups
                              or other operations on the same element.
                              Macros can be created to simplify writing
                              code to access the actual value.

                            
                            The use case is not clear. A more
                              concrete example will help..
                            

                  Having a data_pair_t could help to navigate from an
                  existing element (getting next or previous. This is
                  really interesting if dict where implemented using a
                  sorted structure like a trie since it would allow to
                  process a set of similar entries very fast, like the
                  trusted.afr.<brick> values for example) or
                  removing or replacing it without needing another
                  lookup (a more detailed analysis would be needed to
                  see how to handle race conditions).

                  
                  By the way, is really the dict_t structure used
                  concurrently ? I haven't analyzed all the code deeply,
                  but it seems to me that every dict_t is only accessed
                  from a single place at once.
              
              
              There have been instances of dict_t getting
                used concurrently, when used as xdata and in xattrop (by
                AFR). There have been bugs in the past with concurrent
                dict access.
               
            
      I missed that. Sorry.

      
                             * Use a trie
                              instead of a hash.

                              
                              A trie structure is a bit more complex
                              than a hash, but only processes the key
                              once and does not need to compute the
                              hash. A test implementation I made with a
                              trie shows a significant improvement in
                              dictionary operations.

                            
                            There is already an implementation of
                              trie in libglusterfs/src/trie.c. Though it
                              does not compact (collapse) single-child
                              nodes upwards into the parent. In any
                              case, let's avoid having two
                              implementations of tries.
                          
                        
                  I know. The current implementation wastes a lot of
                  memory because it uses an array of 256 pointers, and
                  in some places it needs to traverse the array. Not a
                  b¡g deal, but if it is made many times it could be
                  noticeable. In my test I used a trie with 4 child
                  pointers (with collapsing single-child nodes) that
                  runs a bit faster than the 256 implementation and uses
                  much less memory. I tried with 2, 4, 16 and 256 childs
                  per node, and 4 seems to be the best (at least for
                  dictionary structures) though there are very little
                  difference between 4 and 16 in terms of speed.

                
              The 256 child pointers give you constant
                time lookup for the next level child with just an offset
                indirection. With smaller fan-out, do you search through
                the list? Can you show an example of this? Collapsing
                single child node upwards is badly needed though.
            
          
      For the case of 4 childs I split each byte in 4 elements of 2
      bits. This makes it very easy to access the next child. It only
      needs some basic logic operations. This is the current
      implementation for the lookup function. It shows how it is
      accessed:

      
      /* TRIE_DIMENSION values:
 *
 *    1: 1 bit per element, 8 elements per byte
 *    2: 2 bits per element, 4 elements per byte
 *    3: 4 bits per element, 2 elements per byte
 *    4: 8 bits per element, 1 element per byte
 */

#define TRIE_DIMENSION      2
#define TRIE_INDEX_BITS     (4 - TRIE_DIMENSION)
#define TRIE_ELEM_BITS      (1 << (TRIE_DIMENSION - 1))
#define TRIE_CHILDS         (1 << TRIE_ELEM_BITS)
#define TRIE_ELEMS_PER_BYTE (1 << TRIE_INDEX_BITS)
#define TRIE_INDEX_MASK     (TRIE_ELEMS_PER_BYTE - 1)
#define TRIE_ELEM_MASK      (TRIE_CHILDS - 1)

#define sys_trie_value(_data, _offs) \
    (((_data) >> (((_offs) & TRIE_INDEX_MASK) * TRIE_ELEM_BITS)) & \
     TRIE_ELEM_MASK)

#define sys_trie_value_idx(_data, _offs) \
    ({ \
        off_t __tmp_offs = (_offs); \
        sys_trie_value(((_data)[(__tmp_offs) >> TRIE_INDEX_BITS]), \
                       __tmp_offs); \
    })

typedef struct _sys_trie_node
{
    struct _sys_trie_node * childs[TRIE_CHILDS];
    struct _sys_trie_node * parent;
    void *                  data;
    uint32_t                count;
    uint32_t                length;
    uint8_t                 key[0];
} sys_trie_node_t;                                                              

typedef struct _sys_trie
{
    sys_trie_node_t root;
} sys_trie_t;

sys_trie_node_t * sys_trie_lookup(sys_trie_t * trie,
                                  const uint8_t * key,
                                  size_t length)
{
    sys_trie_node_t * node;
    size_t len;

    len = length << TRIE_INDEX_BITS;
    for (node = trie->root.childs[sys_trie_value(*key, 0)];
         (node != NULL) && (node->length < len);
         node = node->childs[sys_trie_value_idx(key, node->length)]);

    if ((node != NULL) && (node->length == len) && (node->data != NULL) &&
        (memcmp(node->key, key, length) == 0))
    {
        return node;
    }
    return NULL;
}

      
      It's true that the collapsing feature is not really used for
      dictionaries, however I think it would be interesting to have it
      to use tries for other purposes that may require a long duration
      data structure that eventually adds and removes items.

      
                 I agree that it
                  is not good to maintain two implementations of the
                  same thing. Maybe we could change the trie
                  implementation. It should be transparent.
              
              
              Yes, I believe the current API can
                accommodate such internal changes.
               
              
                             * Implement
                              dict_foreach() as a macro (similar to
                              kernel's list_for_each()).

                              
                              This gives more control and avoids the
                              need of helper functions.

                            
                            This makes sense too, but there are
                              quite a few users of dict_foreach in the
                              existing style. Moving them all over might
                              be a pain.
                          
                        
                  Maybe we could create a differently named macro to
                  implement this feature and allow the developers to
                  slowly change it. The old implementation could be
                  flagged as deprecated and use the new one for new
                  code. Old code will have enough time to change it
                  until eventually the old implementation is removed.

                  
                  If we make important changes to the dict_t structure,
                  we could replace current functions by macros that use
                  the new implementation but simulates the old behavior.
              
              
              Sounds OK.
               
              
                              Additionally, I think it's possible to
                              redefine structures to reduce the number
                              of allocations and pointers used for each
                              element (actual data, data_t, data_pair_t
                              and key).

                            
                            This is highly desirable. There was
                              some effort from Amar in the past (http://review.gluster.org/3910)
                              but it has been in need of attention for
                              some time. It would be intersting to know
                              if you were thinking along similar lines?
                            

                  Yes, it is quite similar though I should analyze it
                  more deeply. I would also try to remove some
                  unused/unneeded fields that are used in very few
                  places, add complexity and can be replaced easily,
                  like extra_free and extra_stdfree in dict_t for
                  example.
                  

              Thanks,
              Avati
               
            
        _______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxx
https://lists.nongnu.org/mailman/listinfo/gluster-devel

      
      _______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxx
https://lists.nongnu.org/mailman/listinfo/gluster-devel