On 18-Jul-24 5:41 PM, Mateusz Guzik wrote:
On Thu, Jul 18, 2024 at 11:00 AM Bharata B Rao <bharata@xxxxxxx> wrote:On 17-Jul-24 4:59 PM, Mateusz Guzik wrote:As for clear_shadow_entry mentioned in the opening mail, the content is: spin_lock(&mapping->host->i_lock); xa_lock_irq(&mapping->i_pages); __clear_shadow_entry(mapping, index, entry); xa_unlock_irq(&mapping->i_pages); if (mapping_shrinkable(mapping)) inode_add_lru(mapping->host); spin_unlock(&mapping->host->i_lock); so for all I know it's all about the xarray thing, not the i_lock per se.The soft lockup signature has _raw_spin_lock and not _raw_spin_lock_irq and hence concluded it to be i_lock.I'm not disputing it was i_lock. I am claiming that the i_pages is taken immediately after and it may be that in your workload this is the thing with the actual contention problem, making i_lock a red herring. I tried to match up offsets to my own kernel binary, but things went haywire. Can you please resolve a bunch of symbols, like this: ./scripts/faddr2line vmlinux clear_shadow_entry+92 and then paste the source code from reported lines? (I presume you are running with some local patches, so opening relevant files in my repo may still give bogus resutls) Addresses are: clear_shadow_entry+92 __remove_mapping+98 __filemap_add_folio+332
clear_shadow_entry+92 $ ./scripts/faddr2line vmlinux clear_shadow_entry+92 clear_shadow_entry+92/0x180: spin_lock_irq at include/linux/spinlock.h:376 (inlined by) clear_shadow_entry at mm/truncate.c:51 42 static void clear_shadow_entry(struct address_space *mapping,43 struct folio_batch *fbatch, pgoff_t *indices)
44 { 45 int i; 46 47 if (shmem_mapping(mapping) || dax_mapping(mapping)) 48 return; 49 50 spin_lock(&mapping->host->i_lock); 51 xa_lock_irq(&mapping->i_pages); __remove_mapping+98 $ ./scripts/faddr2line vmlinux __remove_mapping+98 __remove_mapping+98/0x230: spin_lock_irq at include/linux/spinlock.h:376 (inlined by) __remove_mapping at mm/vmscan.c:695684 static int __remove_mapping(struct address_space *mapping, struct folio *folio, 685 bool reclaimed, struct mem_cgroup *target_memcg)
686 { 687 int refcount; 688 void *shadow = NULL; 689 690 BUG_ON(!folio_test_locked(folio)); 691 BUG_ON(mapping != folio_mapping(folio)); 692 693 if (!folio_test_swapcache(folio)) 694 spin_lock(&mapping->host->i_lock); 695 xa_lock_irq(&mapping->i_pages); __filemap_add_folio+332 $ ./scripts/faddr2line vmlinux __filemap_add_folio+332 __filemap_add_folio+332/0x480: spin_lock_irq at include/linux/spinlock.h:377 (inlined by) __filemap_add_folio at mm/filemap.c:878 851 noinline int __filemap_add_folio(struct address_space *mapping,852 struct folio *folio, pgoff_t index, gfp_t gfp, void **shadowp)
853 { 854 XA_STATE(xas, &mapping->i_pages, index); ... 874 for (;;) { 875 int order = -1, split_order = 0; 876 void *entry, *old = NULL; 877 878 xas_lock_irq(&xas); 879 xas_for_each_conflict(&xas, entry) {
Most notably in __remove_mapping i_lock is conditional: if (!folio_test_swapcache(folio)) spin_lock(&mapping->host->i_lock); xa_lock_irq(&mapping->i_pages); and the disasm of the offset in my case does not match either acquire. For all I know i_lock in this routine is *not* taken and all the queued up __remove_mapping callers increase i_lock -> i_pages wait times in clear_shadow_entry.
So the first two are on i_pages lock and the last one is xa_lock.
To my cursory reading i_lock in clear_shadow_entry can be hacked away with some effort, but should this happen the contention is going to shift to i_pages presumably with more soft lockups (except on that lock). I am not convinced messing with it is justified. From looking at other places the i_lock is not a problem in other spots fwiw. All that said even if it is i_lock in both cases *and* someone whacks it, the mm folk should look into what happens when (maybe i_lock ->) i_pages lock is held. To that end perhaps you could provide a flamegraph or output of perf record -a -g, I don't know what's preferred.
I have attached the flamegraph but this is for the kernel that has been running with all the accumulated fixes so far. The original one (w/o fixes) did show considerable time spent on native_queued_spin_lock_slowpath but unfortunately unable to locate it now.
Regards, Bharata.
Attachment:
perf 1.svg
Description: image/svg