Performance implications of shrinking deferred objects proportional to priority

"Jayaramappa, Srilakshmi" <sjayaram@xxxxxxxxxx> · Mon, 28 Aug 2023 20:03:36 +0000

Hi all,

While testing a 5.15.x kernel with one of our workloads we noticed a significant drop in performance. We tracked down the cause to the linux commit 18bb473e5031213ebfa9a622c0b0f8cdcb8a5371 "mm: vmscan: shrink deferred objects proportional to priority". When  the application enters direct reclaim it's not able to free up enough pages, so it enters the reclaim path more and more often, yet reclaims very little memory resulting in performance degradation. This is because the scan count is significantly scaled down, and will not satisfy the condition to call the shrinker. 

static unsigned long do_shrink_slab(...) {
...
total_scan = nr >> priority;
...
while (total_scan >= batch_size ||
               total_scan >= freeable) {
               unsigned long ret;
                unsigned long nr_to_scan = min(batch_size, total_scan);

                shrinkctl->nr_to_scan = nr_to_scan;
                shrinkctl->nr_scanned = nr_to_scan;
                ret = shrinker->scan_objects(shrinker, shrinkctl);
... }
}
https://github.com/torvalds/linux/blob/8bb7eca972ad531c9b149c0a51ab43a417385813/mm/vmscan.c#L692

Perf profile with the commit:
Here we can see that the very high number of calls to page reclaim lead to severe lock contention.

Samples: 19K of event 'cycles', Event count (approx.): 574013142825
  Overhead       Samples  Symbol
-   15.10%          3008  [k] native_queued_spin_lock_slowpath                                                         
   - native_queued_spin_lock_slowpath                                                                                   
      - 12.20% __lock_text_start                                                                                        
         - 7.54% shrink_inactive_list                                                                                   
              shrink_lruvec                                                                                             
              shrink_node                                                                                               
              do_try_to_free_pages                                                                                      
              try_to_free_pages                                                                                         
              __alloc_pages_slowpath.constprop.0                                                                       
            + __alloc_pages                                                                                             
         - 3.87% lru_note_cost                                                                                          
              shrink_inactive_list                                                                                      
              shrink_lruvec                                                                                             
              shrink_node                                                                                               
              do_try_to_free_pages                                                                                      
              try_to_free_pages                                                                                         
              __alloc_pages_slowpath.constprop.0                                                                        
            - __alloc_pages                                                                                             
               + 1.40% page_cache_ra_unbounded                                                                          
               + 1.19% skb_page_frag_refill                                                                         
               + 1.17% pagecache_get_page                                                                             
         + 0.63% __remove_mapping                                                                                       
      - 1.42% _raw_spin_lock                                                                                            
         + 0.73% rmqueue_bulk                                                                                          
      + 1.41% _raw_spin_lock_irqsave                                                                                    
+    8.97%          1722  [k] copy_user_enhanced_fast_string                                                            
+    4.21%           823  [.] tc_deletearray_aligned_nothrow                                                            
+    3.35%           673  [k] intel_idle

Reverting the commit from the kernel resolved our issue. Here is the profile that looks good with the same test and the kernel sans the commit:
+   13.51%          2721  [k] intel_idle
+    7.93%          1579  [k] copy_user_enhanced_fast_string
+    6.20%          1165  [.] tc_deletearray_aligned_nothrow
+    3.21%           626  [.] tc_calloc
+    2.94%           617  [k] poll_idle
+    2.66%           501  [k] native_queued_spin_lock_slowpath

We added perf probes to do_shrink_slab() and obtained a record to get a picture of the kind of numbers involved.
An example from one of the traces :

       kswapd0   530 [000] 55811.133435:           vmscan:mm_shrink_slab_start: super_cache_scan 0xffff96be8cac6c10: nid: 0 objects to shrink 592560 gfp_flags GFP_KERNEL cache items 332984 delta 650 total_scan 1228 priority 10
         kswapd0   530 [000] 55811.133436:             probe:do_shrink_slab_59_0: (ffffffff86208fc0) shrinkctl=0xffff9e60c172fcb8 shrinker=0xffff96be8cac6c10 priority=10 delta=0x28a total_scan=1228 freeable=332984 nr=592560 batch_size=1024
         kswapd0   530 [000] 55811.133437:             probe:do_shrink_slab_59_1: (ffffffff86208fc7) shrinkctl=0xffff9e60c172fcb8 shrinker=0xffff96be8cac6c10 priority=10 delta=0x28a total_scan=1228 nr=592560 batch_size=1024
         kswapd0   530 [000] 55811.133438:             probe:do_shrink_slab_59_2: (ffffffff86208fca) shrinkctl=0xffff9e60c172fcb8 shrinker=0xffff96be8cac6c10 delta=0x28a total_scan=1228 nr=592560 batch_size=1024

nr = 592560
freeable = 332984
total_scan = 1228
priority 10

650 is the delta value coming from
332984 >> 10 = 325
325 * 4 = 1300
1300 /2 = 650 = 0x28a
(if (shrinker->seeks) {
                delta = freeable >> priority;
                delta *= 4;
                do_div(delta, shrinker->seeks);
)

592560 >> 10 = 578
578 + 650 = 1228

So in this case 1228 is just barely above the batch_size of 1024. I had seen kswapd run with priority 12 too, but it wasn't able to reclaim memory, so its priority was bumped in
balance_pgdat () {
... if (raise_priority || !nr_reclaimed)
        sc.priority--; ... 
}
Our application's priority was constant at 12, so it rarely met the conditions to be able to call the shrinker.

I understand that the commit in question was made when Yang Shi was working with trillions of deferred objects, but wonder if we should have a range of scaling, say a scale factor proportional to nr_deferred? We could perhaps not scale the smaller numbers at  all so that nr_deferred is not brought down so low that it doesn't even let the process reclaim anything? I am not an expert in this area, so thought of sending you details of what we found in our testing in case this is relevant to anyone.

We will be happy to test any changes you suggest and collect data. Please let us know if there is any other information you'd like us to collect that will help better understand the problem.

Thanks
Sri