Re: Hard and soft lockups with FIO and LTP runs on a large system

Bharata B Rao <bharata@xxxxxxx> · Fri, 19 Jul 2024 11:46:09 +0530

On 18-Jul-24 5:41 PM, Mateusz Guzik wrote:
On Thu, Jul 18, 2024 at 11:00 AM Bharata B Rao <bharata@xxxxxxx> wrote:

On 17-Jul-24 4:59 PM, Mateusz Guzik wrote:
As for clear_shadow_entry mentioned in the opening mail, the content is:
          spin_lock(&mapping->host->i_lock);
          xa_lock_irq(&mapping->i_pages);
          __clear_shadow_entry(mapping, index, entry);
          xa_unlock_irq(&mapping->i_pages);
          if (mapping_shrinkable(mapping))
                  inode_add_lru(mapping->host);
          spin_unlock(&mapping->host->i_lock);

so for all I know it's all about the xarray thing, not the i_lock per se.

The soft lockup signature has _raw_spin_lock and not _raw_spin_lock_irq
and hence concluded it to be i_lock.

I'm not disputing it was i_lock. I am claiming that the i_pages is
taken immediately after and it may be that in your workload this is
the thing with the actual contention problem, making i_lock a red
herring.

I tried to match up offsets to my own kernel binary, but things went haywire.

Can you please resolve a bunch of symbols, like this:
./scripts/faddr2line vmlinux clear_shadow_entry+92

and then paste the source code from reported lines? (I presume you are
running with some local patches, so opening relevant files in my repo
may still give bogus resutls)

Addresses are: clear_shadow_entry+92 __remove_mapping+98 __filemap_add_folio+332

clear_shadow_entry+92

$ ./scripts/faddr2line vmlinux clear_shadow_entry+92
clear_shadow_entry+92/0x180:
spin_lock_irq at include/linux/spinlock.h:376
(inlined by) clear_shadow_entry at mm/truncate.c:51

42 static void clear_shadow_entry(struct address_space *mapping,
43                                struct folio_batch *fbatch, pgoff_t 
*indices)
44 {
45         int i;
46
47         if (shmem_mapping(mapping) || dax_mapping(mapping))
48                 return;
49
50         spin_lock(&mapping->host->i_lock);
51         xa_lock_irq(&mapping->i_pages);

__remove_mapping+98

$ ./scripts/faddr2line vmlinux __remove_mapping+98
__remove_mapping+98/0x230:
spin_lock_irq at include/linux/spinlock.h:376
(inlined by) __remove_mapping at mm/vmscan.c:695

684 static int __remove_mapping(struct address_space *mapping, struct 
folio *folio,
685                             bool reclaimed, struct mem_cgroup 
*target_memcg)
686 {
687         int refcount;
688         void *shadow = NULL;
689
690         BUG_ON(!folio_test_locked(folio));
691         BUG_ON(mapping != folio_mapping(folio));
692
693         if (!folio_test_swapcache(folio))
694                 spin_lock(&mapping->host->i_lock);
695         xa_lock_irq(&mapping->i_pages);

__filemap_add_folio+332

$ ./scripts/faddr2line vmlinux __filemap_add_folio+332
__filemap_add_folio+332/0x480:
spin_lock_irq at include/linux/spinlock.h:377
(inlined by) __filemap_add_folio at mm/filemap.c:878

851 noinline int __filemap_add_folio(struct address_space *mapping,
852                 struct folio *folio, pgoff_t index, gfp_t gfp, void 
**shadowp)
853 {
854         XA_STATE(xas, &mapping->i_pages, index);
            ...
874         for (;;) {
875                 int order = -1, split_order = 0;
876                 void *entry, *old = NULL;
877
878                 xas_lock_irq(&xas);
879                 xas_for_each_conflict(&xas, entry) {

Most notably in __remove_mapping i_lock is conditional:
         if (!folio_test_swapcache(folio))
                 spin_lock(&mapping->host->i_lock);
         xa_lock_irq(&mapping->i_pages);

and the disasm of the offset in my case does not match either acquire.
For all I know i_lock in this routine is *not* taken and all the
queued up __remove_mapping callers increase i_lock -> i_pages wait
times in clear_shadow_entry.

So the first two are on i_pages lock and the last one is xa_lock.

To my cursory reading i_lock in clear_shadow_entry can be hacked away
with some effort, but should this happen the contention is going to
shift to i_pages presumably with more soft lockups (except on that
lock). I am not convinced messing with it is justified. From looking
at other places the i_lock is not a problem in other spots fwiw.

All that said even if it is i_lock in both cases *and* someone whacks
it, the mm folk should look into what happens when (maybe i_lock ->)
i_pages lock is held. To that end perhaps you could provide a
flamegraph or output of perf record -a -g, I don't know what's
preferred.

I have attached the flamegraph but this is for the kernel that has been 
running with all the accumulated fixes so far. The original one (w/o 
fixes) did show considerable time spent on 
native_queued_spin_lock_slowpath but unfortunately unable to locate it now.

Regards,
Bharata.
Attachment:
perf 1.svg

Description: image/svg