On Sat, Mar 06, 2021 at 07:15:24PM -0700, Hillf Danton wrote: > On Fri, 5 Mar 2021 18:01:04 +0800 Ming Lei wrote: > > On Fri, Mar 05, 2021 at 10:32:04AM +0100, Paolo Valente wrote: > > > I'm thinking of a way to debug this too. The symptom may hint at a > > > use-after-free. Could you enable KASAN in your tests? (On the flip > > > side, I know this might change timings, thereby making the fault > > > disappear). > > > > I have asked our QE to reproduce the issue with debug kernel, which may take a > > while. And I can't trigger it in my box. > > > > BTW, for the 2nd 'kernel NULL pointer dereference', the RIP points to: > > > > (gdb) l *(__bfq_deactivate_entity+0x5b) > > 0xffffffff814c31cb is in __bfq_deactivate_entity (block/bfq-wf2q.c:1181). > > 1176 * bfq_group_set_parent has already been invoked for the group > > 1177 * represented by entity. Therefore, the field > > 1178 * entity->sched_data has been set, and we can safely use it. > > 1179 */ > > 1180 st = bfq_entity_service_tree(entity); > > 1181 is_in_service = entity == sd->in_service_entity; > > 1182 > > 1183 bfq_calc_finish(entity, entity->service); > > 1184 > > 1185 if (is_in_service) > > > > Seems entity->sched_data points to NULL. > > Hi Ming, > > Thanks for your report. > > Given the invalid pointer cannot explain line 1180, you are reporting > a different issue from what Mike reported, and we can do nothing now > for both without a reproducer. BTW, we get this report 2 times on 5.12 kernel, following the kernel log, and this time there is hard LOCKUP. [ 337.526984] systemd-shutdown[1]: Not all DM devices detached, 1 left. [ 337.526988] systemd-shutdown[1]: Cannot finalize remaining DM devices, continuing. [ 337.531043] systemd-shutdown[1]: Successfully changed into root pivot. [ 337.531046] systemd-shutdown[1]: Returning to initrd... [ 337.533136] watchdog: watchdog0: watchdog did not stop! [ 337.569177] dracut Warning: Killing all remaining processes [ 337.706605] XFS (dm-0): Unmounting Filesystem [ 351.593888] NMI watchdog: Watchdog detected hard LOCKUP on cpu 2 [ 351.593890] Modules linked in: dm_multipath rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace nfs_ssc fscache rfkill sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm mgag200 dcdbas iTCO_wdt irqbypass i2c_algo_bit iTCO_vendor_support rapl drm_kms_helper intel_cstate syscopyarea sysfillrect sysimgblt fb_sys_fops cec intel_uncore pcspkr ipmi_ssif mei_me mei lpc_ich ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm fuse ip_tables xfs libcrc32c sd_mod qla2xxx ahci libahci nvme_fc crct10dif_pclmul crc32_pclmul crc32c_intel nvme_fabrics libata ghash_clmulni_intel tg3 nvme_core megaraid_sas t10_pi scsi_transport_fc wmi dm_mirror dm_region_hash dm_log dm_mod [ 351.593929] CPU: 2 PID: 95 Comm: kworker/2:1 Kdump: loaded Tainted: G X --------- --- 5.12.0-1.el9.x86_64 #1 [ 351.593930] Hardware name: Dell Inc. PowerEdge R430/0HFG24, BIOS 1.6.2 01/08/2016 [ 351.593931] Workqueue: cgwb_release cgwb_release_workfn [ 351.593932] RIP: 0010:rb_prev+0x18/0x50 [ 351.593933] Code: 31 c0 eb db 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 8b 17 48 39 d7 74 35 48 8b 47 10 48 85 c0 74 1c 49 89 c0 48 8b 40 08 <48> 85 c0 75 f4 4c 89 c0 c3 48 3b 78 10 75 f6 48 8b 10 48 89 c7 48 [ 351.593934] RSP: 0018:ffffb7280048fd70 EFLAGS: 00000086 [ 351.593935] RAX: ffff98bc30f448a0 RBX: ffff98bc10d1e150 RCX: 0000000000000014 [ 351.593936] RDX: 0000000000000001 RSI: ffff98bc00b39098 RDI: ffff98bc00b39098 [ 351.593937] RBP: ffff98bc00b39098 R08: ffff98bc30f448a0 R09: 0000000000000000 [ 351.593938] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 [ 351.593939] R13: 0000000000000001 R14: ffff98bc10d1e110 R15: 0000000000000000 [ 351.593940] FS: 0000000000000000(0000) GS:ffff98c37fa80000(0000) knlGS:0000000000000000 [ 351.593941] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 351.593941] CR2: 00007fffeeceaea0 CR3: 0000000105b40003 CR4: 00000000001706e0 [ 351.593942] Call Trace: [ 351.593943] bfq_idle_extract+0x98/0xb0 [ 351.593943] __bfq_deactivate_entity+0x224/0x240 [ 351.593944] bfq_pd_offline+0xaa/0xd0 [ 351.593945] blkg_destroy+0x52/0xf0 [ 351.593945] blkcg_destroy_blkgs+0x46/0xc0 [ 351.593946] cgwb_release_workfn+0xbe/0x150 [ 351.593947] process_one_work+0x1e6/0x380 [ 351.593947] worker_thread+0x53/0x3d0 [ 351.593948] ? process_one_work+0x380/0x380 [ 351.593949] kthread+0x11b/0x140 [ 351.593949] ? kthread_associate_blkcg+0xa0/0xa0 [ 351.593950] ret_from_fork+0x22/0x30 [ 351.593950] Kernel panic - not syncing: Hard LOCKUP Thanks, Ming