On Mon, Sep 18, 2023 at 09:52:28AM +0800, Baokun Li wrote: > On 2023/9/17 17:26, Peter Zijlstra wrote: > > On Sun, Sep 17, 2023 at 11:10:32AM +0200, Peter Zijlstra wrote: > > > On Sat, Sep 16, 2023 at 02:55:47PM +0800, Baokun Li wrote: > > > > On 2023/9/13 16:59, Yi Zhang wrote: > > > > > The issue still can be reproduced on the latest linux tree[2]. > > > > > To reproduce I need to run about 1000 times blktests block/001, and > > > > > bisect shows it was introduced with commit[1], as it was not 100% > > > > > reproduced, not sure if it's the culprit? > > > > > > > > > > > > > > > [1] 9257959a6e5b locking/atomic: scripts: restructure fallback ifdeffery > > > > Hello, everyone! > > > > > > > > We have confirmed that the merge-in of this patch caused hlist_bl_lock > > > > (aka, bit_spin_lock) to fail, which in turn triggered the issue above. > > > > [root@localhost ~]# insmod mymod.ko > > > > [ 37.994787][ T621] >>> a = 725, b = 724 > > > > [ 37.995313][ T621] ------------[ cut here ]------------ > > > > [ 37.995951][ T621] kernel BUG at fs/mymod/mymod.c:42! > > > > [r[ oo 3t7@.l996o4c61al]h[o s T6t21] ~ ]#Int ernal error: Oops - BUG: > > > > 00000000f2000800 [#1] SMP > > > > [ 37.997420][ T621] Modules linked in: mymod(E) > > > > [ 37.997891][ T621] CPU: 9 PID: 621 Comm: bl_lock_thread2 Tainted: > > > > G E 6.4.0-rc2-00034-g9257959a6e5b-dirty #117 > > > > [ 37.999038][ T621] Hardware name: linux,dummy-virt (DT) > > > > [ 37.999571][ T621] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS > > > > BTYPE=--) > > > > [ 38.000344][ T621] pc : increase_ab+0xcc/0xe70 [mymod] > > > > [ 38.000882][ T621] lr : increase_ab+0xcc/0xe70 [mymod] > > > > [ 38.001416][ T621] sp : ffff800008b4be40 > > > > [ 38.001822][ T621] x29: ffff800008b4be40 x28: 0000000000000000 x27: > > > > 0000000000000000 > > > > [ 38.002605][ T621] x26: 0000000000000000 x25: 0000000000000000 x24: > > > > 0000000000000000 > > > > [ 38.003385][ T621] x23: ffffd9930c698190 x22: ffff800008a0ba38 x21: > > > > 0000000000000001 > > > > [ 38.004174][ T621] x20: ffffffffffffefff x19: ffffd9930c69a580 x18: > > > > 0000000000000000 > > > > [ 38.004955][ T621] x17: 0000000000000000 x16: ffffd9933011bd38 x15: > > > > ffffffffffffffff > > > > [ 38.005754][ T621] x14: 0000000000000000 x13: 205d313236542020 x12: > > > > ffffd99332175b80 > > > > [ 38.006538][ T621] x11: 0000000000000003 x10: 0000000000000001 x9 : > > > > ffffd9933022a9d8 > > > > [ 38.007325][ T621] x8 : 00000000000bffe8 x7 : c0000000ffff7fff x6 : > > > > ffffd993320b5b40 > > > > [ 38.008124][ T621] x5 : ffff0001f7d1c708 x4 : 0000000000000000 x3 : > > > > 0000000000000000 > > > > [ 38.008912][ T621] x2 : 0000000000000000 x1 : 0000000000000000 x0 : > > > > 0000000000000015 > > > > [ 38.009709][ T621] Call trace: > > > > [ 38.010035][ T621] increase_ab+0xcc/0xe70 [mymod] > > > > [ 38.010539][ T621] kthread+0xdc/0xf0 > > > > [ 38.010927][ T621] ret_from_fork+0x10/0x20 > > > > [ 38.011370][ T621] Code: 17ffffe0 90000020 91044000 9400000d (d4210000) > > > > [ 38.012067][ T621] ---[ end trace 0000000000000000 ]--- > > > Is this arm64 or something? You seem to have forgotten to mention what > > > platform you're using. > > Is that an LSE or LLSC arm64 ? > > I'm not sure how to distinguish if it's LSE or LLSC, here's some info on the > cpu: > > $ cat /sys/devices/system/cpu/cpu0/regs/identification/midr_el1 > 0x00000000481fd010 > > $ lscpu > Architecture: aarch64 > Byte Order: Little Endian > CPU(s): 96 > On-line CPU(s) list: 0-95 > Thread(s) per core: 1 > Core(s) per socket: 48 > Socket(s): 2 > NUMA node(s): 4 > Vendor ID: HiSilicon > BIOS Vendor ID: HiSilicon > Model: 0 > Model name: Kunpeng-920 > BIOS Model name: Kunpeng 920-4826 > Stepping: 0x1 > BogoMIPS: 200.00 > L1d cache: 64K > L1i cache: 64K > L2 cache: 512K > L3 cache: 49152K > NUMA node0 CPU(s): 0-23 > NUMA node1 CPU(s): 24-47 > NUMA node2 CPU(s): 48-71 > NUMA node3 CPU(s): 72-95 > Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp > asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm > > > Anyway, it seems that ARM64 shouldn't be using the fallback as it does > > everything itself. > > > > Mark, can you have a look please? At first glance the > > atomic64_fetch_or_acquire() that's being used by generic bitops/lock.h > > seems in order.. > > > We also suspect some implicit mechanism change in > raw_atomic64_fetch_or_acquire. You can reproduce the problem with the > above mod that can reproduce the problem to make it easier to locate. > I can help reproduce it and grab some information if you can't reproduce > it on your end. FWIW this looks a lot like the crash I reported last week: https://lore.kernel.org/linux-fsdevel/ZQep0OR0uMmR%2Fwg3@xxxxxxxxxxxxxxxxxxx/T/#t Also arm64, but virtualized. I /think/ the host is some Ampere box, though I have no idea what kind since it's just some Oracle Cloud A1 instance. The internet claims "Ampere Altra" processors[1]. # lscpu Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 2 On-line CPU(s) list: 0,1 Vendor ID: ARM Model name: Neoverse-N1 Model: 1 Thread(s) per core: 1 Core(s) per socket: 2 Socket(s): 1 Stepping: r3p1 BogoMIPS: 50.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0,1 Vulnerabilities: Gather data sampling: Not affected Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Not affected Spec rstack overflow: Not affected Spec store bypass: Vulnerable Spectre v1: Mitigation; __user pointer sanitization Spectre v2: Mitigation; CSV2, but not BHB Srbds: Not affected Tsx async abort: Not affected [1] https://www.oracle.com/cloud/compute/arm/ --D > -- > With Best Regards, > Baokun Li > .