Hi, hoping this is a convenient solution for you, I have attached a compressed archives with a pair of patches. The first patch just does a lot of checks while BFQ runs (BUG_ONs must be turned on for this to work), while the second patch is a tentative fix. Looking forward to your feedback, Paolo
Attachment:
debug_patches_for_5.6.tgz
Description: Binary data
> Il giorno 9 mar 2020, alle ore 18:09, Rachel Sibley <rasibley@xxxxxxxxxx> ha scritto: > > > > On 3/9/20 12:42 PM, Paolo Valente wrote: >> Hi Rachel, >> IIUC, you can reproduce this bug reliably. If so, I'd need you to test a debugging patch (on top of one of the offending kernels). > > Hi Paolo, > > Yes seems we have seen it pretty consistently in the last three reports, but I'm cloning the job to be sure we can > reproduce reliably. In the mean time, feel free to send me a pointer to your debugging patch so I can retry with > the patch applied. > > Thank you, > Rachel > >> Looking forward to your feedback, >> Paolo >>> Il giorno 9 mar 2020, alle ore 15:27, Rachel Sibley <rasibley@xxxxxxxxxx> ha scritto: >>> >>> (cc'ing linux-block@xxxxxxxxxxxxxxx) >>> >>> Hello, >>> >>> We are seeing a kernel panic triggered with LTP and xfstests against a recent commit for mainline, >>> wanted to share in case it's not already known. >>> >>> Kernel repo: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git >>> Commit: 61a09258f2e5 - Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma >>> >>> We have also seen it with 2c523b344dfa and 378fee2e6b12 commits as well. >>> >>> LTP: https://cki-artifacts.s3.us-east-2.amazonaws.com/datawarehouse/2020/03/08/477469/x86_64_1_console.log >>> xfstests: https://cki-artifacts.s3.us-east-2.amazonaws.com/datawarehouse/2020/03/08/477469/x86_64_4_console.log >>> >>> [-- MARK -- Sun Mar 8 02:45:00 2020] >>> [ 762.315610] BUG: kernel NULL pointer dereference, address: 0000000000000158 >>> [ 762.323385] #PF: supervisor read access in kernel mode >>> [ 762.329119] #PF: error_code(0x0000) - not-present page >>> [ 762.334853] PGD 0 P4D 0 >>> [ 762.337680] Oops: 0000 [#1] SMP PTI >>> [ 762.341575] CPU: 9 PID: 87 Comm: kworker/9:1 Not tainted 5.6.0-rc4-61a0925.cki #1 >>> [ 762.349927] Hardware name: Cisco Systems, Inc. UCS-E160DP-M1/K9/UCS-E160DP-M1/K9, BIOS UCSED.1.5.0.2.051520131757 05/15/2013 >>> [ 762.362453] Workqueue: cgroup_destroy css_killed_work_fn >>> [ 762.368387] RIP: 0010:bfq_bfqq_expire+0x1c/0x940 >>> [ 762.373540] Code: 01 00 00 c7 80 f8 00 00 00 01 00 00 00 c3 66 66 66 66 90 41 57 41 56 41 55 41 54 41 89 cc 55 48 89 fd 53 48 89 f3 48 83 ec 28 <8b> be 58 01 00 00 65 48 8b 04 25 28 00 00 00 48 89 44 24 20 31 c0 >>> [ 762.394500] RSP: 0018:ffff9927c03bbd50 EFLAGS: 00010086 >>> [ 762.400331] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004 >>> [ 762.408301] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8965a3913800 >>> [ 762.416270] RBP: ffff8965a3913800 R08: ffff896592d41098 R09: ffff89657aa8df00 >>> [ 762.424233] R10: 0000000000000000 R11: ffff89657aa8df00 R12: 0000000000000004 >>> [ 762.432200] R13: ffff89659f0cd9b0 R14: ffff8965a3913bf0 R15: ffff89659f0cd898 >>> [ 762.440175] FS: 0000000000000000(0000) GS:ffff8965a7c40000(0000) knlGS:0000000000000000 >>> [ 762.449211] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [ 762.455622] CR2: 0000000000000158 CR3: 000000065afc6003 CR4: 00000000000606e0 >>> [ 762.463599] Call Trace: >>> [ 762.466341] ? bfq_idle_extract+0x40/0xb0 >>> [ 762.470821] bfq_bfqq_move+0x14f/0x160 >>> [ 762.475011] bfq_pd_offline+0xd3/0xf0 >>> [ 762.479112] blkg_destroy+0x52/0xf0 >>> [ 762.483005] blkcg_destroy_blkgs+0x4f/0xa0 >>> [ 762.487582] css_killed_work_fn+0x4d/0xd0 >>> [ 762.492066] process_one_work+0x1b5/0x360 >>> [ 762.496547] worker_thread+0x50/0x3c0 >>> [ 762.500641] kthread+0xf9/0x130 >>> [ 762.504153] ? process_one_work+0x360/0x360 >>> [ 762.508813] ? kthread_park+0x90/0x90 >>> [ 762.512909] ret_from_fork+0x35/0x40 >>> >>> Thanks, >>> Rachel >>> >>> On 3/7/20 9:59 PM, CKI Project wrote: >>>> Hello, >>>> We ran automated tests on a recent commit from this kernel tree: >>>> Kernel repo: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git >>>> Commit: 61a09258f2e5 - Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma >>>> The results of these automated tests are provided below. >>>> Overall result: FAILED (see details below) >>>> Merge: OK >>>> Compile: OK >>>> Tests: FAILED >>>> All kernel binaries, config files, and logs are available for download here: >>>> https://cki-artifacts.s3.us-east-2.amazonaws.com/index.html?prefix=datawarehouse/2020/03/08/477469 >>>> One or more kernel tests failed: >>>> x86_64: >>>> ❌ LTP >>>> ❌ xfstests - ext4 >>>> We hope that these logs can help you find the problem quickly. For the full >>>> detail on our testing procedures, please scroll to the bottom of this message. >>>> Please reply to this email if you have any questions about the tests that we >>>> ran or if you have any suggestions on how to make future tests more effective. >>>> ,-. ,-. >>>> ( C ) ( K ) Continuous >>>> `-',-.`-' Kernel >>>> ( I ) Integration >>>> `-' >>>> ______________________________________________________________________________ >>>> Compile testing >>>> --------------- >>>> We compiled the kernel for 1 architecture: >>>> x86_64: >>>> make options: -j30 INSTALL_MOD_STRIP=1 targz-pkg >>>> Hardware testing >>>> ---------------- >>>> We booted each kernel and ran the following tests: >>>> x86_64: >>>> Host 1: >>>> ✅ Boot test >>>> ✅ Podman system integration test - as root >>>> ✅ Podman system integration test - as user >>>> ❌ LTP >>>> ⚡⚡⚡ Loopdev Sanity >>>> ⚡⚡⚡ Memory function: memfd_create >>>> ⚡⚡⚡ AMTU (Abstract Machine Test Utility) >>>> ⚡⚡⚡ Networking bridge: sanity >>>> ⚡⚡⚡ Ethernet drivers sanity >>>> ⚡⚡⚡ Networking MACsec: sanity >>>> ⚡⚡⚡ Networking socket: fuzz >>>> ⚡⚡⚡ Networking sctp-auth: sockopts test >>>> ⚡⚡⚡ Networking: igmp conformance test >>>> ⚡⚡⚡ Networking route: pmtu >>>> ⚡⚡⚡ Networking route_func - local >>>> ⚡⚡⚡ Networking route_func - forward >>>> ⚡⚡⚡ Networking TCP: keepalive test >>>> ⚡⚡⚡ Networking UDP: socket >>>> ⚡⚡⚡ Networking tunnel: geneve basic test >>>> ⚡⚡⚡ Networking tunnel: gre basic >>>> ⚡⚡⚡ L2TP basic test >>>> ⚡⚡⚡ Networking tunnel: vxlan basic >>>> ⚡⚡⚡ Networking ipsec: basic netns - transport >>>> ⚡⚡⚡ Networking ipsec: basic netns - tunnel >>>> ⚡⚡⚡ audit: audit testsuite test >>>> ⚡⚡⚡ httpd: mod_ssl smoke sanity >>>> ⚡⚡⚡ tuned: tune-processes-through-perf >>>> ⚡⚡⚡ pciutils: sanity smoke test >>>> ⚡⚡⚡ ALSA PCM loopback test >>>> ⚡⚡⚡ ALSA Control (mixer) Userspace Element test >>>> ⚡⚡⚡ storage: SCSI VPD >>>> ⚡⚡⚡ trace: ftrace/tracer >>>> 🚧 ⚡⚡⚡ CIFS Connectathon >>>> 🚧 ⚡⚡⚡ POSIX pjd-fstest suites >>>> 🚧 ⚡⚡⚡ jvm - DaCapo Benchmark Suite >>>> 🚧 ⚡⚡⚡ jvm - jcstress tests >>>> 🚧 ⚡⚡⚡ Memory function: kaslr >>>> 🚧 ⚡⚡⚡ LTP: openposix test suite >>>> 🚧 ⚡⚡⚡ Networking vnic: ipvlan/basic >>>> 🚧 ⚡⚡⚡ iotop: sanity >>>> 🚧 ⚡⚡⚡ Usex - version 1.9-29 >>>> 🚧 ⚡⚡⚡ storage: dm/common >>>> Host 2: >>>> ✅ Boot test >>>> ✅ Storage SAN device stress - mpt3sas driver >>>> Host 3: >>>> ✅ Boot test >>>> ✅ Storage SAN device stress - megaraid_sas >>>> Host 4: >>>> ✅ Boot test >>>> ❌ xfstests - ext4 >>>> ⚡⚡⚡ xfstests - xfs >>>> ⚡⚡⚡ selinux-policy: serge-testsuite >>>> ⚡⚡⚡ lvm thinp sanity >>>> ⚡⚡⚡ storage: software RAID testing >>>> ⚡⚡⚡ stress: stress-ng >>>> 🚧 ⚡⚡⚡ IOMMU boot test >>>> 🚧 ⚡⚡⚡ IPMI driver test >>>> 🚧 ⚡⚡⚡ IPMItool loop stress test >>>> 🚧 ⚡⚡⚡ power-management: cpupower/sanity test >>>> 🚧 ⚡⚡⚡ Storage blktests >>>> Test sources: https://github.com/CKI-project/tests-beaker >>>> 💚 Pull requests are welcome for new tests or improvements to existing tests! >>>> Waived tests >>>> ------------ >>>> If the test run included waived tests, they are marked with 🚧. Such tests are >>>> executed but their results are not taken into account. Tests are waived when >>>> their results are not reliable enough, e.g. when they're just introduced or are >>>> being fixed. >>>> Testing timeout >>>> --------------- >>>> We aim to provide a report within reasonable timeframe. Tests that haven't >>>> finished running yet are marked with ⏱. >>> >