Re: Hard and soft lockups with FIO and LTP runs on a large system

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 11-Jul-24 11:13 AM, Bharata B Rao wrote:
On 09-Jul-24 11:28 AM, Yu Zhao wrote:
On Mon, Jul 8, 2024 at 10:31 PM Bharata B Rao <bharata@xxxxxxx> wrote:

On 08-Jul-24 9:47 PM, Yu Zhao wrote:
On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@xxxxxxx> wrote:

Hi Yu Zhao,

Thanks for your patches. See below...

On 07-Jul-24 4:12 AM, Yu Zhao wrote:
Hi Bharata,

On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@xxxxxxx> wrote:

<snip>

Some experiments tried
======================
1) When MGLRU was enabled many soft lockups were observed, no hard
lockups were seen for 48 hours run. Below is once such soft lockup.

This is not really an MGLRU issue -- can you please try one of the
attached patches? It (truncate.patch) should help with or without
MGLRU.

With truncate.patch and default LRU scheme, a few hard lockups are seen.

Thanks.

In your original report, you said:

    Most of the times the two contended locks are lruvec and
    inode->i_lock spinlocks.
    ...
    Often times, the perf output at the time of the problem shows
    heavy contention on lruvec spin lock. Similar contention is
    also observed with inode i_lock (in clear_shadow_entry path)

Based on this new report, does it mean the i_lock is not as contended,
for the same path (truncation) you tested? If so, I'll post
truncate.patch and add reported-by and tested-by you, unless you have
objections.

truncate.patch has been tested on two systems with default LRU scheme
and the lockup due to inode->i_lock hasn't been seen yet after 24 hours run.

Thanks.


The two paths below were contended on the LRU lock, but they already
batch their operations. So I don't know what else we can do surgically
to improve them.

What has been seen with this workload is that the lruvec spinlock is
held for a long time from shrink_[active/inactive]_list path. In this
path, there is a case in isolate_lru_folios() where scanning of LRU
lists can become unbounded. To isolate a page from ZONE_DMA, sometimes
scanning/skipping of more than 150 million folios were seen. There is
already a comment in there which explains why nr_skipped shouldn't be
counted, but is there any possibility of re-looking at this condition?

For this specific case, probably this can help:

@@ -1659,8 +1659,15 @@ static unsigned long
isolate_lru_folios(unsigned long nr_to_scan,
                 if (folio_zonenum(folio) > sc->reclaim_idx ||
                                 skip_cma(folio, sc)) {
                         nr_skipped[folio_zonenum(folio)] += nr_pages;
-                       move_to = &folios_skipped;
-                       goto move;
+                       list_move(&folio->lru, &folios_skipped);
+                       if (spin_is_contended(&lruvec->lru_lock)) {
+                               if (!list_empty(dst))
+                                       break;
+                               spin_unlock_irq(&lruvec->lru_lock);
+                               cond_resched();
+                               spin_lock_irq(&lruvec->lru_lock);
+                       }
+                       continue;
                 }

Thanks, this helped. With this fix, the test ran for 24hrs without any lockups attributable to lruvec spinlock. As noted in this thread, earlier isolate_lru_folios() used to scan millions of folios and spend a lot of time with spinlock held but after this fix, such a scenario is no longer seen.

However during the weekend mglru-enabled run (with above fix to isolate_lru_folios() and also the previous two patches: truncate.patch and mglru.patch and the inode fix provided by Mateusz), another hard lockup related to lruvec spinlock was observed.

Here is the hardlock up:

watchdog: Watchdog detected hard LOCKUP on cpu 466
CPU: 466 PID: 3103929 Comm: fio Not tainted 6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32
RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
Call Trace:
  <NMI>
  ? show_regs+0x69/0x80
  ? watchdog_hardlockup_check+0x1b4/0x3a0
<SNIP>
  ? native_queued_spin_lock_slowpath+0x2b4/0x300
  </NMI>
  <IRQ>
  _raw_spin_lock_irqsave+0x5b/0x70
  folio_lruvec_lock_irqsave+0x62/0x90
  folio_batch_move_lru+0x9d/0x160
  folio_rotate_reclaimable+0xab/0xf0
  folio_end_writeback+0x60/0x90
  end_buffer_async_write+0xaa/0xe0
  end_bio_bh_io_sync+0x2c/0x50
  bio_endio+0x108/0x180
  blk_mq_end_request_batch+0x11f/0x5e0
  nvme_pci_complete_batch+0xb5/0xd0 [nvme]
  nvme_irq+0x92/0xe0 [nvme]
  __handle_irq_event_percpu+0x6e/0x1e0
  handle_irq_event+0x39/0x80
  handle_edge_irq+0x8c/0x240
  __common_interrupt+0x4e/0xf0
  common_interrupt+0x49/0xc0
  asm_common_interrupt+0x27/0x40

Here is the lock holder details captured by all-cpu-backtrace:

NMI backtrace for cpu 75
CPU: 75 PID: 3095650 Comm: fio Not tainted 6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32
RIP: 0010:folio_inc_gen+0x142/0x430
Call Trace:
  <NMI>
  ? show_regs+0x69/0x80
  ? nmi_cpu_backtrace+0xc5/0x130
  ? nmi_cpu_backtrace_handler+0x11/0x20
  ? nmi_handle+0x64/0x180
  ? default_do_nmi+0x45/0x130
  ? exc_nmi+0x128/0x1a0
  ? end_repeat_nmi+0xf/0x53
  ? folio_inc_gen+0x142/0x430
  ? folio_inc_gen+0x142/0x430
  ? folio_inc_gen+0x142/0x430
  </NMI>
  <TASK>
  isolate_folios+0x954/0x1630
  evict_folios+0xa5/0x8c0
  try_to_shrink_lruvec+0x1be/0x320
  shrink_one+0x10f/0x1d0
  shrink_node+0xa4c/0xc90
  do_try_to_free_pages+0xc0/0x590
  try_to_free_pages+0xde/0x210
  __alloc_pages_noprof+0x6ae/0x12c0
  alloc_pages_mpol_noprof+0xd9/0x220
  folio_alloc_noprof+0x63/0xe0
  filemap_alloc_folio_noprof+0xf4/0x100
  page_cache_ra_unbounded+0xb9/0x1a0
  page_cache_ra_order+0x26e/0x310
  ondemand_readahead+0x1a3/0x360
  page_cache_sync_ra+0x83/0x90
  filemap_get_pages+0xf0/0x6a0
  filemap_read+0xe7/0x3d0
  blkdev_read_iter+0x6f/0x140
  vfs_read+0x25b/0x340
  ksys_read+0x67/0xf0
  __x64_sys_read+0x19/0x20
  x64_sys_call+0x1771/0x20d0
  do_syscall_64+0x7e/0x130

Regards,
Bharata.




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux