Issue fixed by commit 53a59fc67f97 is surfacing again..

Laurent Dufour <ldufour@xxxxxxxxxxxxxxxxxx> · Fri, 29 Jun 2018 17:32:00 +0200




Hi,

The commit 53a59fc67f97 (mm: limit mmu_gather batching to fix soft lockups on
!CONFIG_PREEMPT) fixed soft lockup displayed when large processes exited.

Today on a large system, we are seeing it again :

NMI watchdog: BUG: soft lockup - CPU#1015 stuck for 21s! [forkoff:182534]
Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver
nfs lockd grace sunrpc fscache af_packet ip_set nfnetlink bridge stp llc
libcrc32c x_tables dm_mod ghash_generic gf128mul vmx_crypto rtc_generic tg3 ses
enclosure scsi_transport_sas ptp pps_core libphy btrfs xor raid6_pq sd_mod
crc32c_vpmsum ipr(X) libata sg scsi_mod autofs4 [last unloaded: ip_tables]
Supported: Yes, External
CPU: 1015 PID: 182534 Comm: forkoff Tainted: G
4.12.14-23-default #1 SLE15
task: c00001f262efcb00 task.stack: c00001f264688000
NIP: c0000000000164c4 LR: c0000000000164c4 CTR: 000000000000aa18
REGS: c00001f26468b570 TRAP: 0901   Tainted: G
(4.12.14-23-default)
MSR: 800000010280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>
  CR: 42042824  XER: 00000000
CFAR: c00000000099829c SOFTE: 1
GPR00: c0000000002d43b8 c00001f26468b7f0 c00000000116a900 0000000000000900
GPR04: c00014fb0fff6410 f0000005075aa860 0000000000000008 0000000000000000
GPR08: c000000007d39d00 00000000800003d8 00000000800003f7 000014fa8e880000
GPR12: 0000000000002200 c000000007d39d00
NIP [c0000000000164c4] arch_local_irq_restore+0x74/0x90
LR [c0000000000164c4] arch_local_irq_restore+0x74/0x90
Call Trace:
[c00001f26468b7f0] [f0000005075a9500] 0xf0000005075a9500 (unreliable)
[c00001f26468b810] [c0000000002d43b8] free_unref_page_list+0x198/0x280
[c00001f26468b870] [c0000000002e1064] release_pages+0x3d4/0x510
[c00001f26468b950] [c000000000343acc] free_pages_and_swap_cache+0x12c/0x160
[c00001f26468b9a0] [c000000000318a88] tlb_flush_mmu_free+0x68/0xa0
[c00001f26468b9e0] [c00000000031c7ac] zap_pte_range+0x30c/0xa40
[c00001f26468bae0] [c00000000031d344] unmap_page_range+0x334/0x6d0
[c00001f26468bbc0] [c00000000031dc84] unmap_vmas+0x94/0x140
[c00001f26468bc10] [c00000000032b478] exit_mmap+0xe8/0x1f0
[c00001f26468bcd0] [c0000000000ff460] mmput+0x80/0x1c0
[c00001f26468bd00] [c000000000109430] do_exit+0x370/0xc70
[c00001f26468bdd0] [c000000000109e00] do_group_exit+0x60/0x100
[c00001f26468be10] [c000000000109ec4] SyS_exit_group+0x24/0x30
[c00001f26468be30] [c00000000000b088] system_call+0x3c/0x12c
Instruction dump:
994d02ba 2fa30000 409e0024 e92d0020 61298000 7d210164 38210020 e8010010
7c0803a6 4e800020 60000000 4bff4165 <60000000> 4bffffe4 60000000 e92d0020

This has been created on a 32TB node where ~1500 processes, each allocating
10GB, are spawning/exiting in a stressing loop.

As Power is 64K page size based, MAX_GATHER_BATCH = 8189, so
MAX_GATHER_BATCH_COUNT will not exceed 1.

So there is no way to loop in zap_pte_range() due to the batch's limit.
I guess we are never hitting the workaround introduced in the commit
53a59fc67f97. By the way should cond_resched being called in zap_pte_range()
when the flush is due to the batch's limit ?
Something like that :

@@ -1338,7 +1345,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
                        if (unlikely(page_mapcount(page) < 0))
                                print_bad_pte(vma, addr, ptent, page);
                        if (unlikely(__tlb_remove_page(tlb, page))) {
-                               force_flush = 1;
+                               force_flush = 2;
                                addr += PAGE_SIZE;
                                break;
                        }
@@ -1398,12 +1405,19 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
         * batch buffers or because we needed to flush dirty TLB
         * entries before releasing the ptl), free the batched
         * memory too. Restart if we didn't do everything.
+        * In the case the flush was due to the batch buffer's limit,
+        * give a chance to the other task to be run to avoid soft lockup
+        * when dealing with large amount of memory.
         */
        if (force_flush) {
+               bool force_sched = (force_flush == 2);
                force_flush = 0;
                tlb_flush_mmu_free(tlb);
-               if (addr != end)
+               if (addr != end) {
+                       if (force_sched)
+                               cond_resched();
                        goto again;
+               }
        }

Anyway, this should not fix the soft lockup I'm facing because
MAX_GATHER_BATCH_COUNT=1 on ppc64.

Indeed, I'm wondering if the 10K pages is too large in some cases, especially
when the node is loaded, and contention on the pte lock is likely to happen.
Here with less than 8k pages processed soft lockup are surfacing.

Should the MAX_GATHER_BATCH limit be forced to lower value on ppc64 or more
code introduced to work around that ?

Cheers,
Laurent.