Sure Eric, Send details again I run a very high load traffic (Iscsi storage-related IO load )on GCP. After one day of running, my kernel has been stack with two typical cases involving page fault. 1) Soft lockup, as described in the first typical case, 2) Panic as described in the second case. First typical case: (the soft lockup happens on several CPUs): Feb 21 07:38:52 c-node02 kernel: [242408.563170] ? flush_tlb_func_common.constprop.10+0x250/0x250 Feb 21 07:38:52 c-node02 kernel: [242408.563171] on_each_cpu_mask+0x23/0x60 Feb 21 07:38:52 c-node02 kernel: [242408.563173] ? x86_configure_nx+0x40/0x40 Feb 21 07:38:52 c-node02 kernel: [242408.563174] on_each_cpu_cond_mask+0xa0/0xd0 Feb 21 07:38:52 c-node02 kernel: [242408.563175] ? flush_tlb_func_common.constprop.10+0x250/0x250 Feb 21 07:38:52 c-node02 kernel: [242408.563177] flush_tlb_mm_range+0xbc/0xf0 Feb 21 07:38:52 c-node02 kernel: [242408.563179] ptep_clear_flush+0x40/0x50 Feb 21 07:38:52 c-node02 kernel: [242408.563180] try_to_unmap_one+0x2ae/0xae0 Feb 21 07:38:52 c-node02 kernel: [242408.563184] ? mutex_lock+0xe/0x30 Feb 21 07:38:52 c-node02 kernel: [242408.563186] rmap_walk_anon+0x13a/0x2c0 Feb 21 07:38:52 c-node02 kernel: [242408.563188] try_to_unmap+0x9c/0xf0 Feb 21 07:38:52 c-node02 kernel: [242408.563190] ? page_remove_rmap+0x330/0x330 Feb 21 07:38:52 c-node02 kernel: [242408.563192] ? page_not_mapped+0x20/0x20 Feb 21 07:38:52 c-node02 kernel: [242408.563193] ? page_get_anon_vma+0x80/0x80 Feb 21 07:38:52 c-node02 kernel: [242408.563195] ? invalid_mkclean_vma+0x20/0x20 Feb 21 07:38:52 c-node02 kernel: [242408.563196] migrate_pages+0x3cd/0xc80 Feb 21 07:38:52 c-node02 kernel: [242408.563197] ? do_pages_stat+0x180/0x180 Feb 21 07:38:52 c-node02 kernel: [242408.563198] migrate_misplaced_page+0x15e/0x270 Feb 21 07:38:52 c-node02 kernel: [242408.563200] __handle_mm_fault+0xd80/0x12f0 Feb 21 07:38:52 c-node02 kernel: [242408.563202] handle_mm_fault+0xc2/0x1f0 Feb 21 07:38:52 c-node02 kernel: [242408.563204] __do_page_fault+0x23e/0x4f0 Feb 21 07:38:52 c-node02 kernel: [242408.563206] do_page_fault+0x30/0x110 Feb 21 07:38:52 c-node02 kernel: [242408.563207] page_fault+0x3e/0x50 Feb 21 07:38:52 c-node02 kernel: [242408.563209] RIP: 0033:0x7f27fffb9e73 Feb 21 07:38:52 c-node02 kernel: [242408.563211] Code: 89 6d e8 48 89 fb 4c 89 75 f0 4c 89 7d f8 49 89 f6 4c 89 65 e0 48 81 ec c0 06 00 00 4c 8b 3d 3c a1 34 00 49 89 d5 64 41 8b 07 <89> 85 dc fa ff ff 8b 87 c0 00 00 00 85 c0 0f 85 b9 01 00 00 c7 87 Feb 21 07:38:52 c-node02 kernel: [242408.563211] RSP: 002b:00007f12a37fda10 EFLAGS: 00010202 Feb 21 07:38:52 c-node02 kernel: [242408.563213] RAX: 0000000000000000 RBX: 00007f12a37fe0e0 RCX: 0000000000000000 Feb 21 07:38:52 c-node02 kernel: [242408.563214] RDX: 00007f12a37fe200 RSI: 00000000017a9453 RDI: 00007f12a37fe0e0 Feb 21 07:38:52 c-node02 kernel: [242408.563214] RBP: 00007f12a37fe0d0 R08: 0000000000000000 R09: 00000000017c7550 Feb 21 07:38:52 c-node02 kernel: [242408.563215] R10: 0000000000000000 R11: 00000000000003f8 R12: 00000000017a9453 Feb 21 07:38:52 c-node02 kernel: [242408.563216] R13: 00007f12a37fe200 R14: 00000000017a9453 R15: fffffffffffffe90 Feb 21 07:38:52 c-node02 kernel: [242408.604094] watchdog: BUG: soft lockup - CPU#45 stuck for 22s! [km_target_creat:49068] Feb 21 07:38:52 c-node02 kernel: [242408.604095] Modules linked in: iscsi_scst(OE) crc32c_intel(O) scst_local(OE) netconsole(O) scst_user(OE) scst(OE) drbd(O) lru_cache(O) loop(O) 8021q(O) mrp(O) garp(O) nfsd(O) nfs_acl(O) auth_rpcgss(O) lockd(O) sunrpc(O) grace(O) xt_MASQUERADE(O) xt_nat(O) xt_state(O) iptable_nat(O) xt_addrtype(O) xt_conntrack(O) nf_nat(O) nf_conntrack(O) nf_defrag_ipv4(O) nf_defrag_ipv6(O) libcrc32c(O) br_netfilter(O) bridge(O) stp(O) llc(O) overlay(O) be2iscsi(O) iscsi_boot_sysfs(O) bnx2i(O) cnic(O) uio(O) cxgb4i(O) cxgb4(O) cxgb3i(O) libcxgbi(O) cxgb3(O) mdio(O) libcxgb(O) ib_iser(OE) iscsi_tcp(O) libiscsi_tcp(O) libiscsi(O) scsi_transport_iscsi(O) dm_multipath(O) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mdev(OE) mlxfw(OE) ptp(O) pps_core(O) mlx4_ib(OE) ib_core(OE) mlx4_core(OE) mlx_compat(OE) fuse(O) binfmt_misc(O) pvpanic(O) pcspkr(O) virtio_rng(O) virtio_net(O) net_failover(O) failover(O) i2 : Second typical case PANIC: >From the cosule: [123080.813877] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [ 0.000000] Linux version 5.4.80-KM8 (david.mozes@kbuilder64-tc8-test1) (gcc version 8.3.1 20190311 (Red Hat 8.3.1-3) (GCC)) #14 SMP Mon Jan 11 16:21:21 IST 2021 Mon Jan 11 16:21:21 IST 2021 [ 0.000000] Command line: ro root=LABEL=/ rd_NO_LUKS KEYBOARDTYPE=pc KEYTABLE=us LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 nompath append="nmi_watchdog=2" >From the vmcore-dmesg: [121271.606463] ll header: 00000000: 42 01 0a ad 0c 02 42 01 0a ad 0c 01 08 00 [122656.730235] sh (27931): drop_caches: 3 [123080.813877] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [123080.813887] sched: RT throttling activated [123080.821706] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: serial8250_console_write+0x26e/0x270 After I comment out After I comment out the cond_resched(), everything looks more stable. I Will try another run as Eric sagest with the: cond_resched() before the: Invalidtated_mapping_pages See and report regarding the behavior. I think we have a very stressful environment on GCP for testing that Thx David -----Original Message----- From: Eric Sandeen <sandeen@xxxxxxxxxxx> Sent: Wednesday, March 17, 2021 3:28 AM To: David Mozes <david.mozes@xxxxxxx>; linux-fsdevel@xxxxxxxxxxxxxxx Cc: sandeen@xxxxxxxxxx Subject: Re: fs: avoid softlockups in s_inodes iterators commit On 3/16/21 3:56 PM, David Mozes wrote: > Hi, > Per Eric's request, I forward this discussion to the list first. > My first answers are inside ok, but you stripped out all of the other useful information like backtraces, stack corruption, etc. You need to provide the evidence of the actual failure for the list to see. Also .. > -----Original Message----- > From: Eric Sandeen <sandeen@xxxxxxxxxx> > Sent: Tuesday, March 16, 2021 10:18 PM > To: David Mozes <david.mozes@xxxxxxx> > Subject: Re: Mail from David.Mozes regarding fs: avoid softlockups in > s_inodes iterators commit > > On 3/16/21 3:02 PM, David Mozes wrote: >> Hi Eric, >> ... > David > Not sure yet, Will check. >> 5.4.8 vanilla kernel it custom > > Is it vanilla, or is it custom? 5.4.8 or 5.4.80? > > David> 5.4.80 small custom as I mantion. what is a "small custom?" Can you reproduce it on an unmodified upstream kernel? -Eric