Re: Question regarding hash_resize

Akira Yokosawa <akiyks@xxxxxxxxx> · Wed, 9 Jan 2019 00:35:37 +0900

On 2019/01/09 0:28, Paul E. McKenney wrote:
> On Tue, Jan 08, 2019 at 09:56:57AM +0800, Junchang Wang wrote:
>> On Tue, Jan 8, 2019 at 7:06 AM Akira Yokosawa <akiyks@xxxxxxxxx> wrote:
>>> On 2019/01/08 07:54:16 +0900, Akira Yokosawa wrote:
>>>> Hi Paul,
>>>>
>>>> On 2019/01/07 10:33:17 -0800, Paul E. McKenney wrote:
>>>>> On Mon, Jan 07, 2019 at 09:49:19PM +0800, Junchang Wang wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> I'm reading hash_resize recently, and have a few questions regarding
>>>>>> this algorithm. Please take a look if you have time. Any suggestions
>>>>>> are warmly welcomed.
>>>>>>
>>>>>> === Question 1 ===
>>>>>> In hash_resize.c : hashtab_lock_mod
>>>>>> 186         if (b > READ_ONCE(htp->ht_resize_cur)) {
>>>>>> 187                 lsp->hbp[1] = NULL;
>>>>>> 188                 return;
>>>>>> 189         }
>>>>>> 190         htp = rcu_dereference(htp->ht_new);
>>>>>>
>>>>>> It seems we are missing a barrier (e.g., smp_mb) in between lines 189
>>>>>> and 190, because neither READ_ONCE() nor rcu_dereference() can prevent
>>>>>> compilers and hardware from reordering the two unrelated variables,
>>>>>> ht_resize_cur and ht_new. Is my understanding correct?
>>>>>
>>>>> Ah, but hashtab_lock_mod() is invoked within an RCU read-side critical
>>>>> section
>>>>
>>>> You mean "rcu_read_lock() at the beginning of hashtab_lock_mod() starts
>>>> an RCU read-side critical section", don't you?
>>>>
>>>>>         and there is a synchronize_rcu() between the update to ->ht_new
>>>>> and the updates to ->ht_resize_cur.  For more details on how this works,
>>>>> please see https://lwn.net/Articles/573497/.
>>>>>
>>>>> Of course, if you find a code path in which a call to hashtab_lock_mod()
>>>>> is invoked outside of an RCU read-side critical section, that would be
>>>>> a bug.  (Can you tell me an exception to this rule, that is, a case
>>>>> where hashtab_lock_mod() could safely be invoked outside of an RCU
>>>>> read-side critical section?)
>>>>>
>>>>>> === Question 2 ===
>>>>>> In hash_resize.c, each time an updater wants to access a bucket, the
>>>>>> updater must first acquire the bucket's lock (htb_lock), preventing
>>>>>> other updaters accessing the same bucket concurrently. This approach
>>>>>> is OK if the linked list of a bucket is relatively short, but for a
>>>>>> larger system where linked lists are long enough and the
>>>>>> perftest_resize thread is running simultaneously, it could become a
>>>>>> potential performance bottleneck. One naive solution is to allow
>>>>>> multiple updaters to access the same bucket, only if they don't
>>>>>> operate on the same item of the list of this bucket. I wonder if there
>>>>>> are any existing works or discussions on this topic?
>>>>>
>>>>> One approach is to use a hashed array of locks, and to hash a given
>>>>> element's address to locate the lock to be used.  Please see
>>>>> Section 7.1.1.5 ("Conditional Locking") and Section 7.1.1.6 ("Acquire
>>>>> Needed Locks First"), including Quick Quiz 7.9, for additional details.
>>>>>
>>>>> Another approach is to use RCU to protect traversals, and locks within the
>>>>> linked-list elements themselves.  These locks are conditionally acquired
>>>>> (again, please see Section 7.1.1.5), and deadlock is avoided by acquiring
>>>>> them in list order, and the tricks in Quick Quiz 7.9.
>>>>>
>>>>> Non-blocking synchronization can also be used, but it is often quite a
>>>>> bit more complicated.  See for example the split-order list of Shalev
>>>>> and Shavit, along with Desnoyers's RCU-protected extension in the
>>>>> userspace RCU library.
>>>>>
>>>>> But it is usually -way- better to just choose a good hash function and
>>>>> to increase the number of buckets.  Which is of course one reason for
>>>>> having resizable hash tables.  ;-)
>>>>>
>>>>> But the other techniques can be useful in more complex linked data
>>>>> structures, such as graphs, where there is no reasonable way to
>>>>> partition the data.  Nevertheless, many people choose to do the
>>>>> partitioning anyway, especially on distributed systems.
>>>>>
>>>>>> === Question 3 ===
>>>>>> Chapter Data Structures also discusses other resizable hash tables,
>>>>>> namely "Resizable, scalable, concurrent hash tables via relativistic
>>>>>> programming" from Josh Triplett, which can save memory footprint by
>>>>>> using a single pair of pointers. But my understanding is that
>>>>>> perftest_resize.c is unique in that it allows you to rebuild the hash
>>>>>> table by utilizing a different hash function, which could be very
>>>>>> useful in practice (e.g., to prevent DDoS attack). Other solutions do
>>>>>> not share this property. Is my understanding correct? Did I miss any
>>>>>> discussions on this topic in perfbook?
>>>>>
>>>>> Indeed, to the best of my knowledge, Herbert Xu's pointer-pair approach
>>>>> (which I use in hash_resize.c) is the only one allowing arbitrary changes
>>>>> to hash functions.  I expect that this advantage will become increasingly
>>>>> important as security issues become more challenging.  Furthermore, I
>>>>> suspect that the pointer-pair approach is faster and more scalable.
>>>>> It is certainly simpler.
>>>>>
>>>>> On the other hand, one advantage of the other two approaches is decreased
>>>>> memory consumption.
>>>>>
>>>>> Another advantage of Josh Triplett's pointer-unzip approach is that
>>>>> concurrent updates are (in theory, anyway) not blocked for as long
>>>>> by resize operations.  The other edge of this sword is that resizing
>>>>> is much slower, given the need to wait for many RCU grace periods.
>>>>>
>>>>> Another advantage of Mathieu Desnoyers's RCUified variant of Shalev
>>>>> and Shavit's split-order list is that all operations are non-blocking,
>>>>> which can be important on massively overloaded systems, such as one
>>>>> might find in cloud computing.
>>>>>
>>>>>> === Question 4 ===
>>>>>> In the current implementation of hash_resize.c, the perftest_resize
>>>>>> could block an updater, and vice versa. It seems this is not what we
>>>>>> expected. Ideally, they should be allowed to run concurrently, or at
>>>>>> least the perftest_resize thread should have lower priority and
>>>>>> updaters should never be blocked by the perftest_resize thread. Is
>>>>>> that right? I'm very interested in helping improve. Please let me know
>>>>>> if you have any suggestions.
>>>>>
>>>>> In hash_resize.c, an updater is blocked only for the time required to
>>>>> redisposition a bucket.  This is a great improvement over blocking
>>>>> updaters for the full resize over all buckets.
>>>>>
>>>>> But yes, it is not hard to do better, for example, periodically dropping
>>>>> the old-table lock in hashtab_resize().  This requires a few careful
>>>>> adjustments, of course.  Can you tell me what these adjustments are?
>>>>>
>>>>> Hmmm...  I could simplify hashtab_lookup(), couldn't I?  After all,
>>>>> optimizing for the race with hashtab_resize() doesn't make a whole lot
>>>>> of sense.  Please see the patch below.  Thoughts?
>>>>>
>>>>>                                                      Thanx, Paul
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>> commit 737646a9c868d841b32199b52f5569668975953e
>>>>> Author: Paul E. McKenney <paulmck@xxxxxxxxxxxxx>
>>>>> Date:   Mon Jan 7 10:29:14 2019 -0800
>>>>>
>>>>>     datastruct/hash: Simplify hashtab_lookup()
>>>>>
>>>>>     Because resizing leaves the old hash table intact, and because lookups
>>>>>     are carried out within RCU read-side critical sections (which prevent
>>>>>     a second resizing operation from starting), there is no need for a
>>>>>     lookup to search anywhere but in the old hash table.  And in the common
>>>>>     case, there is no resize, so there is no new hash table.  Therefore,
>>>>>     eliminating the check for resizing speeds things up in the common
>>>>>     case.  In addition, this simplifies the code.
>>>>>
>>>>>     This commit therefore eliminates the ht_get_bucket() function,
>>>>>     renames the ht_get_bucket_single() function to ht_get_bucket(),
>>>>>     and modifies callers appropriately.
>>>>>
>>>>>     Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxx>
>>>>>
>>>>> diff --git a/CodeSamples/datastruct/hash/hash_resize.c b/CodeSamples/datastruct/hash/hash_resize.c
>>>>> index 29e05f907200..be4157959b83 100644
>>>>> --- a/CodeSamples/datastruct/hash/hash_resize.c
>>>>> +++ b/CodeSamples/datastruct/hash/hash_resize.c
>>>>> @@ -124,8 +124,7 @@ void hashtab_free(struct hashtab *htp_master)
>>>>>  //\begin{snippet}[labelbase=ln:datastruct:hash_resize:get_bucket,commandchars=\\\@\$]
>>>>>  /* Get hash bucket corresponding to key, ignoring the possibility of resize. */
>>>>>  static struct ht_bucket *                           //\lnlbl{single:b}
>>>>> -ht_get_bucket_single(struct ht *htp, void *key, long *b,
>>>>> -                     unsigned long *h)
>>>>> +ht_get_bucket(struct ht *htp, void *key, long *b, unsigned long *h)
>>>>>  {
>>>>>      unsigned long hash = htp->ht_gethash(key);
>>>>>
>>>>> @@ -134,24 +133,6 @@ ht_get_bucket_single(struct ht *htp, void *key, long *b,
>>>>>              *h = hash;                              //\lnlbl{single:h}
>>>>>      return &htp->ht_bkt[*b];                        //\lnlbl{single:return}
>>>>>  }                                                   //\lnlbl{single:e}
>>>>> -
>>>>> -/* Get hash bucket correesponding to key, accounting for resize. */
>>>>> -static struct ht_bucket *                           //\lnlbl{b}
>>>>> -ht_get_bucket(struct ht **htp, void *key, long *b, int *i)
>>>>> -{
>>>>> -    struct ht_bucket *htbp;
>>>>> -
>>>>> -    htbp = ht_get_bucket_single(*htp, key, b, NULL); //\lnlbl{call_single}
>>>>> -                                                            //\fcvexclude
>>>>> -    if (*b <= READ_ONCE((*htp)->ht_resize_cur)) {   //\lnlbl{resized}
>>>>> -            smp_mb(); /* order ->ht_resize_cur before ->ht_new. */
>>>>
>>>> If we can remove this memory barrier, the counterpart smp_mb() in
>>>> hashtab_resize() becomes unnecessary, doesn't it?
>>>
>>> And the WRITE_ONCE() in the following line.
>>>
>>>         Thanks, Akira
>>>>
>>>>         Thanks, Akira
>>>>
>>>>> -            *htp = rcu_dereference((*htp)->ht_new); //\lnlbl{newtable}
>>>>> -            htbp = ht_get_bucket_single(*htp, key, b, NULL); //\lnlbl{newbucket}
>>>>> -    }
>>>>> -    if (i)                                          //\lnlbl{chk_i}
>>>>> -            *i = (*htp)->ht_idx;                    //\lnlbl{set_idx}
>>>>> -    return htbp;                                    //\lnlbl{return}
>>>>> -}                                                   //\lnlbl{e}
>>>>>  //\end{snippet}
>>>>>
>>>>>  /* Read-side lock/unlock functions. */
>>>>> @@ -178,7 +159,7 @@ hashtab_lock_mod(struct hashtab *htp_master, void *key,
>>>>>
>>>>>      rcu_read_lock();                                //\lnlbl{l:rcu_lock}
>>>>>      htp = rcu_dereference(htp_master->ht_cur);      //\lnlbl{l:refhashtbl}
>>>>> -    htbp = ht_get_bucket_single(htp, key, &b, &h);  //\lnlbl{l:refbucket}
>>>>> +    htbp = ht_get_bucket(htp, key, &b, &h);         //\lnlbl{l:refbucket}
>>>>>      spin_lock(&htbp->htb_lock);                     //\lnlbl{l:acq_bucket}
>>>>>      lsp->hbp[0] = htbp;                             //\lnlbl{l:lsp0b}
>>>>>      lsp->hls_idx[0] = htp->ht_idx;
>>>>> @@ -188,7 +169,7 @@ hashtab_lock_mod(struct hashtab *htp_master, void *key,
>>>>>              return;                                 //\lnlbl{l:fastret1}
>>>>>      }
>>>>>      htp = rcu_dereference(htp->ht_new);             //\lnlbl{l:new_hashtbl}
>>>>> -    htbp = ht_get_bucket_single(htp, key, &b, &h);  //\lnlbl{l:get_newbkt}
>>>>> +    htbp = ht_get_bucket(htp, key, &b, &h);         //\lnlbl{l:get_newbkt}
>>>>>      spin_lock(&htbp->htb_lock);                     //\lnlbl{l:acq_newbkt}
>>>>>      lsp->hbp[1] = htbp;                             //\lnlbl{l:lsp1b}
>>>>>      lsp->hls_idx[1] = htp->ht_idx;
>>>>> @@ -223,16 +204,15 @@ struct ht_elem *                                       //\lnlbl{lkp:b}
>>>>>  hashtab_lookup(struct hashtab *htp_master, void *key)
>>>>>  {
>>>>>      long b;
>>>>> -    int i;
>>>>>      struct ht *htp;
>>>>>      struct ht_elem *htep;
>>>>>      struct ht_bucket *htbp;
>>>>>
>>>>>      htp = rcu_dereference(htp_master->ht_cur);      //\lnlbl{lkp:get_curtbl}
>>>>> -    htbp = ht_get_bucket(&htp, key, &b, &i);        //\lnlbl{lkp:get_curbkt}
>>>>> +    htbp = ht_get_bucket(htp, key, &b, NULL);       //\lnlbl{lkp:get_curbkt}
>>>>>      cds_list_for_each_entry_rcu(htep,               //\lnlbl{lkp:loop:b}
>>>>>                                  &htbp->htb_head,
>>>>> -                                hte_next[i]) {
>>>>> +                                hte_next[htp->ht_idx]) {
>>>>>              if (htp->ht_cmp(htep, key))             //\lnlbl{lkp:match}
>>>>>                      return htep;                    //\lnlbl{lkp:ret_match}
>>>>>      }                                               //\lnlbl{lkp:loop:e}
>>>>> @@ -303,7 +283,7 @@ int hashtab_resize(struct hashtab *htp_master,
>>>>>              htbp = &htp->ht_bkt[i];                 //\lnlbl{get_oldcur}
>>>>>              spin_lock(&htbp->htb_lock);             //\lnlbl{acq_oldcur}
>>>>>              cds_list_for_each_entry(htep, &htbp->htb_head, hte_next[idx]) { //\lnlbl{loop_list:b}
>>>>> -                    htbp_new = ht_get_bucket_single(htp_new, htp_new->ht_getkey(htep), &b, NULL);
>>>>> +                    htbp_new = ht_get_bucket(htp_new, htp_new->ht_getkey(htep), &b, NULL);
>>>>>                      spin_lock(&htbp_new->htb_lock);
>>>>>                      cds_list_add_rcu(&htep->hte_next[!idx], &htbp_new->htb_head);
>>>>>                      spin_unlock(&htbp_new->htb_lock);
>>>>> diff --git a/datastruct/datastruct.tex b/datastruct/datastruct.tex
>>>>> index 5c61bf5e2389..0152437c274e 100644
>>>>> --- a/datastruct/datastruct.tex
>>>>> +++ b/datastruct/datastruct.tex
>>>>> @@ -966,10 +966,8 @@ the old table.
>>>>>  \begin{lineref}[ln:datastruct:hash_resize:get_bucket]
>>>>>  Bucket selection is shown in
>>>>>  Listing~\ref{lst:datastruct:Resizable Hash-Table Bucket Selection},
>>>>> -which shows \co{ht_get_bucket_single()} on
>>>>> -lines~\lnref{single:b}-\lnref{single:e} and
>>>>> -\co{ht_get_bucket()} on lines~\lnref{b}-\lnref{e}.
>>>>> -The \co{ht_get_bucket_single()} function returns a reference to the bucket
>>>>> +which shows \co{ht_get_bucket()}.
>>>>> +This function returns a reference to the bucket
>>>>>  corresponding to the specified key in the specified hash table, without
>>>>>  making any allowances for resizing.
>>>>>  It also stores the bucket index corresponding to the key into the location
>>>>> @@ -978,36 +976,6 @@ line~\lnref{single:gethash}, and the corresponding
>>>>>  hash value corresponding to the key into the location
>>>>>  referenced by parameter~\co{h} (if non-\co{NULL}) on line~\lnref{single:h}.
>>>>>  Line~\lnref{single:return} then returns a reference to the corresponding bucket.
>>>>> -
>>>>> -The \co{ht_get_bucket()} function handles hash-table selection, invoking
>>>>> -\co{ht_get_bucket_single()} on
>>>>> -line~\lnref{call_single} to select the bucket
>>>>> -corresponding to the hash in the current
>>>>> -hash table, storing the hash value through parameter~\co{b}.
>>>>> -If line~\lnref{resized} determines that the table is being resized and that
>>>>> -line~\lnref{call_single}'s bucket has already been distributed across the new hash
>>>>> -table, then line~\lnref{newtable} selects the new hash table and
>>>>> -line~\lnref{newbucket}
>>>>> -selects the bucket corresponding to the hash in the new hash table,
>>>>> -again storing the hash value through parameter~\co{b}.
>>>>> -\end{lineref}
>>>>> -
>>>>> -\QuickQuiz{}
>>>>> -    The code in
>>>>> -    Listing~\ref{lst:datastruct:Resizable Hash-Table Bucket Selection}
>>>>> -    computes the hash twice!
>>>>> -    Why this blatant inefficiency?
>>>>> -\QuickQuizAnswer{
>>>>> -    The reason is that the old and new hash tables might have
>>>>> -    completely different hash functions, so that a hash computed
>>>>> -    for the old table might be completely irrelevant to the
>>>>> -    new table.
>>>>> -} \QuickQuizEnd
>>>>> -
>>>>> -\begin{lineref}[ln:datastruct:hash_resize:get_bucket]
>>>>> -If line~\lnref{chk_i} finds that parameter~\co{i} is non-\co{NULL}, then
>>>>> -line~\lnref{set_idx} stores the pointer-set index for the selected hash table.
>>>>> -Finally, line~\lnref{return} returns a reference to the selected hash bucket.
>>>>>  \end{lineref}
>>>>>
>>>>>  \QuickQuiz{}
>>>>> @@ -1021,10 +989,8 @@ Finally, line~\lnref{return} returns a reference to the selected hash bucket.
>>>>>      functions described next.
>>>>>  } \QuickQuizEnd
>>>>>
>>>>> -This implementation of
>>>>> -\co{ht_get_bucket_single()} and \co{ht_get_bucket()}
>>>>> -permit lookups and modifications to run concurrently
>>>>> -with a resize operation.
>>>>> +This implementation of \co{ht_get_bucket()} permits lookups and
>>>>> +modifications to run concurrently with a resize operation.
>>>>>
>>>>>  \begin{listing}[tb]
>>>>>  \input{CodeSamples/datastruct/hash/hash_resize@lock_unlock_mod.fcv}
>>>>> @@ -1129,11 +1095,6 @@ hash lookups.
>>>>>  Line~\lnref{get_curtbl} fetches the current hash table and
>>>>>  line~\lnref{get_curbkt} obtains a reference
>>>>>  to the bucket corresponding to the specified key.
>>>>> -This bucket will be located in a new resized hash table when a
>>>>> -resize operation has progressed past the bucket in the old hash
>>>>> -table that contained the desired data element.
>>>>> -Note that line~\lnref{get_curbkt} also passes back the index that will be
>>>>> -used to select the correct set of pointers from the pair in each element.
>>>>>  The loop spanning lines~\lnref{loop:b}-\lnref{loop:e} searches the bucket,
>>>>>  so that if line~\lnref{match}
>>>>>  detects a match,
>>>>> @@ -1144,22 +1105,17 @@ failure.
>>>>>  \end{lineref}
>>>>>
>>>>>  \QuickQuiz{}
>>>>> -    In the \co{hashtab_lookup()} function in
>>>>> -    Listing~\ref{lst:datastruct:Resizable Hash-Table Access Functions},
>>>>> -    the code carefully finds the right bucket in the new hash table
>>>>> -    if the element to be looked up has already been distributed
>>>>> -    by a concurrent resize operation.
>>>>> -    This seems wasteful for RCU-protected lookups.
>>>>> -    Why not just stick with the old hash table in this case?
>>>>> +    \begin{lineref}[ln:datastruct:hash_resize:access:lkp]
>>>>> +    What if execution reaches line~\lnref{loop:b}
>>>>> +    of \co{hashtab_lookup()} in
>>>>> +    Listing~\ref{lst:datastruct:Resizable Hash-Table Access Functions}
>>>>> +    just after this bucket has been resized.
>>>>> +    Won't that result in lookup failures?
>>>>> +    \end{lineref}
>>>>>  \QuickQuizAnswer{
>>>>> -    Suppose that a resize operation begins and distributes half of
>>>>> -    the old table's buckets to the new table.
>>>>> -    Suppose further that a thread adds a new element that goes into
>>>>> -    one of the already-distributed buckets, and that this same thread
>>>>> -    now looks up this newly added element.
>>>>> -    If lookups unconditionally traversed only the old hash table,
>>>>> -    this thread would get a lookup failure for the element that it
>>>>> -    just added, which certainly sounds like a bug to me!
>>>>> +    No, it won't.
>>>>> +    Resizing into the new hash table leaves the old hash table
>>>>> +    intact, courtesy of the pointer pairs.
>>>>>  } \QuickQuizEnd
>>>>>
>>>>>  \begin{lineref}[ln:datastruct:hash_resize:access:add]
>>>>>
>>
>> Hi Paul and Akira,
>>
>> Thanks a lot for the comments, which I need some more time to look
>> into. For Paul's patch, I have a few concerns. Please take a look.
>>
>> My understanding is that with this path, during the time period when
>> the resizing thread is running, an updater may insert/delete an item
>> into/from the new hash table, while readers are still looking up data
>> in the old one, resulting the readers are unaware of
>> insertions/deletions happening simultaneously. For example, it seems
>> the following sequence could happen.
>>
>> 1. The resizing thread starts.
>> 2. The resizing thread successfully passes bucket *B* of the old hash table.
>> 3. An updater wants to insert a new item *I* which should be inserted
>> into bucket *B*.
>> 4. The updater will select the new hash table and insert the item *I*
>> into the new hash table.
>> 5. A read request comes in and wants to lookup item *I*. The lookup
>> request will check the old hash table and fail. Doesn't it?
>> 6. The resizing thread exits.
>> 7. Now subsequent read requests can successfully find item *I*.
> 
> Yes, this can happen.
> 
>> Is my understanding correct? Please let me know if I misunderstood
>> anything. Give the truth that this patch can accelerate the fast path,
>> I think it should be OK because resizing is typically happen rarely.
>> Just want to make sure I fully understand the algorithm.
> 
> It is a design choice, and some users would prefer not to fail to see
> new items during a resize.  One approach would be to revert back to
> the old-style checking, and another would be to provide a separate
> lookup interface that synchronizes with adds and deletes.
> 
> So, I could add a quick quiz with this information, I could revert the
> change, or I could add another lookup function that provided more timely
> information.  Left to myself, I would provide a quick quiz, but what
> do you guys think?

Hi, I was composing a message, but now I'm replying to this one.
I think adding a quick quiz would be a good idea.

        Thanks, Akira

> 
> 							Thanx, Paul
>