Re: Hard and soft lockups with FIO and LTP runs on a large system

Bharata B Rao <bharata@xxxxxxx> · Wed, 17 Jul 2024 16:20:04 +0530

On 17-Jul-24 3:07 PM, Vlastimil Babka wrote:
On 7/9/24 6:30 AM, Bharata B Rao wrote:
On 08-Jul-24 9:47 PM, Yu Zhao wrote:
On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@xxxxxxx> wrote:

Hi Yu Zhao,

Thanks for your patches. See below...

On 07-Jul-24 4:12 AM, Yu Zhao wrote:
Hi Bharata,

On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@xxxxxxx> wrote:

<snip>

Some experiments tried
======================
1) When MGLRU was enabled many soft lockups were observed, no hard
lockups were seen for 48 hours run. Below is once such soft lockup.

This is not really an MGLRU issue -- can you please try one of the
attached patches? It (truncate.patch) should help with or without
MGLRU.

With truncate.patch and default LRU scheme, a few hard lockups are seen.

Thanks.

In your original report, you said:

    Most of the times the two contended locks are lruvec and
    inode->i_lock spinlocks.
    ...
    Often times, the perf output at the time of the problem shows
    heavy contention on lruvec spin lock. Similar contention is
    also observed with inode i_lock (in clear_shadow_entry path)

Based on this new report, does it mean the i_lock is not as contended,
for the same path (truncation) you tested? If so, I'll post
truncate.patch and add reported-by and tested-by you, unless you have
objections.

truncate.patch has been tested on two systems with default LRU scheme
and the lockup due to inode->i_lock hasn't been seen yet after 24 hours run.

The two paths below were contended on the LRU lock, but they already
batch their operations. So I don't know what else we can do surgically
to improve them.

What has been seen with this workload is that the lruvec spinlock is
held for a long time from shrink_[active/inactive]_list path. In this
path, there is a case in isolate_lru_folios() where scanning of LRU
lists can become unbounded. To isolate a page from ZONE_DMA, sometimes
scanning/skipping of more than 150 million folios were seen. There is

It seems weird to me to see anything that would require ZONE_DMA allocation
on a modern system. Do you know where it comes from?

We measured the lruvec spinlock start, end and hold
time(htime) using sched_clock(), along with a BUG() if the hold time was
more than 10s. The below case shows that lruvec spin lock was held for ~25s.

vmscan: unlock_page_lruvec_irq: stime 27963327514341, etime 
27963324369895, htime 25889317166 (time in ns)

kernel BUG at include/linux/memcontrol.h:1677!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 21 PID: 3211 Comm: kswapd0 Tainted: G        W 
6.10.0-rc3-qspindbg #10
RIP: 0010:shrink_active_list+0x40a/0x520
Call Trace:
  <TASK>
  shrink_lruvec+0x981/0x13b0
  shrink_node+0x358/0xd30
  balance_pgdat+0x3a3/0xa60
  kswapd+0x207/0x3a0
  kthread+0xe1/0x120
  ret_from_fork+0x39/0x60
  ret_from_fork_asm+0x1a/0x30
  </TASK>

As you can see the call stack is from kswapd but not sure what is the 
exact trigger.

Regards,
Bharata.