Re: Hard and soft lockups with FIO and LTP runs on a large system

Bharata B Rao <bharata@xxxxxxx> · Mon, 29 Jul 2024 10:19:57 +0530

On 26-Jul-24 8:56 AM, Zhaoyang Huang wrote:
On Thu, Jul 25, 2024 at 6:00 PM zhaoyang.huang
<zhaoyang.huang@xxxxxxxxxx> wrote:
<snip>
 From the callstack of lock holder, it is looks like a scability issue rather than a deadlock. Unlike legacy LRU management, there is no throttling mechanism for global reclaim under mglru so far.Could we apply the similar method to throttle the reclaim when it is too aggresive. I am wondering if this patch which is a rough version could help on this?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2e34de9cd0d4..827036e21f24 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4520,6 +4520,50 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
         return scanned;
  }

+static void lru_gen_throttle(pg_data_t *pgdat, struct scan_control *sc)
+{
+       struct lruvec *target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
+
+       if (current_is_kswapd()) {
+               if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
+                       set_bit(PGDAT_WRITEBACK, &pgdat->flags);
+
+               /* Allow kswapd to start writing pages during reclaim.*/
+               if (sc->nr.unqueued_dirty == sc->nr.file_taken)
+                       set_bit(PGDAT_DIRTY, &pgdat->flags);
+
+               if (sc->nr.immediate)
+                       reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
+       }
+
+       /*
+        * Tag a node/memcg as congested if all the dirty pages were marked
+        * for writeback and immediate reclaim (counted in nr.congested).
+        *
+        * Legacy memcg will stall in page writeback so avoid forcibly
+        * stalling in reclaim_throttle().
+        */
+       if (sc->nr.dirty && (sc->nr.dirty / 2 < sc->nr.congested)) {
+               if (cgroup_reclaim(sc) && writeback_throttling_sane(sc))
+                       set_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags);
+
+               if (current_is_kswapd())
+                       set_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags);
+       }
+
+       /*
+        * Stall direct reclaim for IO completions if the lruvec is
+        * node is congested. Allow kswapd to continue until it
+        * starts encountering unqueued dirty pages or cycling through
+        * the LRU too quickly.
+        */
+       if (!current_is_kswapd() && current_may_throttle() &&
+           !sc->hibernation_mode &&
+           (test_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags) ||
+            test_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags)))
+               reclaim_throttle(pgdat, VMSCAN_THROTTLE_CONGESTED);
+}
+
  static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
  {
         int type;
@@ -4552,6 +4596,16 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
  retry:
         reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false);
         sc->nr_reclaimed += reclaimed;
+       sc->nr.dirty += stat.nr_dirty;
+       sc->nr.congested += stat.nr_congested;
+       sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
+       sc->nr.writeback += stat.nr_writeback;
+       sc->nr.immediate += stat.nr_immediate;
+       sc->nr.taken += scanned;
+
+       if (type)
+               sc->nr.file_taken += scanned;
+
         trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
                         scanned, reclaimed, &stat, sc->priority,
                         type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
@@ -5908,6 +5962,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)

         if (lru_gen_enabled() && root_reclaim(sc)) {
                 lru_gen_shrink_node(pgdat, sc);
+               lru_gen_throttle(pgdat, sc);
                 return;
         }
Hi Bharata,
This patch arised from a regression Android test case failure which
allocated 1GB virtual memory by each over 8 threads on an 5.5GB RAM
system. This test could pass on legacy LRU management while failing
under MGLRU as a watchdog monitor detected abnormal system-wide
schedule status(watchdog can't be scheduled within 60 seconds). This
patch with a slight change as below got passed in the test whereas has
not been investigated deeply for how it was done. Theoretically, this
patch enrolled the similar reclaim throttle mechanism as legacy do
which could reduce the contention of lruvec->lru_lock. I think this
patch is quite naive for now， but I am hoping it could help you as
your case seems like a scability issue of memory pressure rather than
a deadlock issue. Thank you!

the change of the applied version(try to throttle the reclaim before
instead of after)
          if (lru_gen_enabled() && root_reclaim(sc)) {
  +               lru_gen_throttle(pgdat, sc);
                  lru_gen_shrink_node(pgdat, sc);
  -               lru_gen_throttle(pgdat, sc);
                  return;
          }

Thanks Zhaoyang Huang for the patch, will give this a test and report back.

Regards,
Bharata.