Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless

Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx> · Wed, 5 Jul 2023 11:27:28 +0800






On 2023/7/4 11:45, Qi Zheng wrote:


On 2023/7/4 00:39, Paul E. McKenney wrote:
On Fri, Jun 23, 2023 at 04:29:39PM +1000, Dave Chinner wrote:
On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote:
On 6/22/23 10:53, Qi Zheng wrote:
@@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t 
gfp_mask, int nid,
      if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
          return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
-    if (!down_read_trylock(&shrinker_rwsem))
-        goto out;
-
-    list_for_each_entry(shrinker, &shrinker_list, list) {
+    rcu_read_lock();
+    list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
          struct shrink_control sc = {
              .gfp_mask = gfp_mask,
              .nid = nid,
              .memcg = memcg,
          };
+        if (!shrinker_try_get(shrinker))
+            continue;
+        rcu_read_unlock();

I don't think you can do this unlock?

Sorry to be slow to respond here, this one fell through the cracks.
And thank you to Qi for reminding me!

If you do this unlock, you had jolly well better nail down the current
element (the one referenced by shrinker), for example, by acquiring an
explicit reference count on the object.  And presumably this is exactly
what shrinker_try_get() is doing.  And a look at your 24/29 confirms 
this,
at least assuming that shrinker->refcount is set to zero before the call
to synchronize_rcu() in free_module() *and* that synchronize_rcu() 
doesn't
start until *after* shrinker_put() calls complete().  Plus, as always,
the object must be removed from the list before the synchronize_rcu()
starts.  (On these parts of the puzzle, I defer to those more familiar
with this code path.  And I strongly suggest carefully commenting this
type of action-at-a-distance design pattern.)

Yeah, I think I've done it like above. A more detailed timing diagram is
below.


Why is this important?  Because otherwise that object might be freed
before you get to the call to rcu_read_lock() at the end of this loop.
And if that happens, list_for_each_entry_rcu() will be walking the
freelist, which is quite bad for the health and well-being of your 
kernel.

There are a few other ways to make this sort of thing work:

1.    Defer the shrinker_put() to the beginning of the loop.
    You would need a flag initially set to zero, and then set to
    one just before (or just after) the rcu_read_lock() above.
    You would also need another shrinker_old pointer to track the
    old pointer.  Then at the top of the loop, if the flag is set,
    invoke shrinker_put() on shrinker_old.    This ensures that the
    previous shrinker structure stays around long enough to allow
    the loop to find the next shrinker structure in the list.

    This approach is attractive when the removal code path
    can invoke shrinker_put() after the grace period ends.

2.    Make shrinker_put() invoke call_rcu() when ->refcount reaches
    zero, and have the callback function free the object.  This of
    course requires adding an rcu_head structure to the shrinker
    structure, which might or might not be a reasonable course of
    action.  If adding that rcu_head is reasonable, this simplifies
    the logic quite a bit.

3.    For the shrinker-structure-removal code path, remove the shrinker
    structure, then remove the initial count from ->refcount,
    and then keep doing grace periods until ->refcount is zero,
    then do one more.  Of course, if the result of removing the
    initial count was zero, then only a single additional grace
    period is required.

    This would need to be carefully commented, as it is a bit
    unconventional.

Thanks for such a detailed addition!


There are probably many other ways, but just to give an idea of a few
other ways to do this.

+
          ret = do_shrink_slab(&sc, shrinker, priority);
          if (ret == SHRINK_EMPTY)
              ret = 0;
          freed += ret;
-        /*
-         * Bail out if someone want to register a new shrinker to
-         * prevent the registration from being stalled for long 
periods
-         * by parallel ongoing shrinking.
-         */
-        if (rwsem_is_contended(&shrinker_rwsem)) {
-            freed = freed ? : 1;
-            break;
-        }
-    }
-    up_read(&shrinker_rwsem);
-out:
+        rcu_read_lock();

That new rcu_read_lock() won't help AFAIK, the whole
list_for_each_entry_rcu() needs to be under the single 
rcu_read_lock() to be
safe.

Yeah, that's the pattern we've been taught and the one we can look
at and immediately say "this is safe".

This is a different pattern, as has been explained bi Qi, and I
think it *might* be safe.

*However.*

Right now I don't have time to go through a novel RCU list iteration
pattern it one step at to determine the correctness of the
algorithm. I'm mostly worried about list manipulations that can
occur outside rcu_read_lock() section bleeding into the RCU
critical section because rcu_read_lock() by itself is not a memory
barrier.

Maybe Paul has seen this pattern often enough he could simply tell
us what conditions it is safe in. But for me to work that out from
first principles? I just don't have the time to do that right now.

If the code does just the right sequence of things on the removal path
(remove, decrement reference, wait for reference to go to zero, wait for
grace period, free), then it would work.  If this is what is happening,
I would argue for more comments.  ;-)

The order of the removal path is slightly different from this:

     shrink_slab                 unregister_shrinker
     ===========                 ===================

    shrinker_try_get()
    rcu_read_unlock()
                                 1. decrement initial reference
                 shrinker_put()
                 2. wait for reference to go to zero
                 wait_for_completion()
    rcu_read_lock()

    shrinker_put()
                 3. remove the shrinker from list
                 list_del_rcu()
                                 4. wait for grace period
                 kfree_rcu()/synchronize_rcu()


    list_for_each_entry()

    shrinker_try_get()
    rcu_read_unlock()
                 5. free the shrinker

So the order is: decrement reference, wait for reference to go to zero,
remove, wait for grace period, free.

I think this can work. And we can only do the *step 3* after we hold the
RCU read lock again, right? Please let me know if I missed something.

Oh, you are right, It would be better to move step 3 to step 1. We
should first remove the shrinker from the shrinker_list to prevent
other traversers from finding it again, otherwise the following
situations may occur theoretically:

CPU 0                 CPU 1

shrinker_try_get()

                      shrinker_try_get()

shrinker_put()
shrinker_try_get()
                      shrinker_put()

Thanks,
Qi


Thanks,
Qi


                            Thanx, Paul

IIUC this is why Dave in [4] suggests unifying shrink_slab() with
shrink_slab_memcg(), as the latter doesn't iterate the list but uses 
IDR.

Yes, I suggested the IDR route because radix tree lookups under RCU
with reference counted objects are a known safe pattern that we can
easily confirm is correct or not.  Hence I suggested the unification
+ IDR route because it makes the life of reviewers so, so much
easier...

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx