On 2022-09-21 16:37, Logan Gunthorpe wrote: > > > On 2022-09-21 15:33, Song Liu wrote: >> Hi Jens, >> >> Please consider pulling the following changes for md-next on top of your >> for-6.1/block branch (for-6.1/drivers branch doesn't exist yet). >> >> The major changes are: >> >> 1. Various raid5 fix and clean up, by Logan Gunthorpe and David Sloan. >> 2. Raid10 performance optimization, by Yu Kuai. >> 3. Generate CHANGE uevents for md device, by Mateusz Grzonka. > > I may have hit a bug with my tests on the latest md-next branch. Still > trying to hit it again. The last tests I ran for several days with some > patches on the previous md-next branch, but I didn't have Mateusz's > changes, and it also looks like the branch was rebased today so it could > be caused by either of those things. I'll let you know when I know more. Yes, ok, I've found two separate issues and both are fixed by reverting 21023a82bff7 ("md: generate CHANGE uevents for md device") I suggest we drop that patch for this cycle so we can sort them out. The issues are: 1) The concrete issue comes when running mdadm test 01r1fail. I get the kernel bugs at the end of this email. It seems we cannot call kobject_uevent() in at least one of the contexts that md_new_event() is called in because it sleeps in a critical section. 2) With our custom test suite that creates and destroys arrays, adds and removes disks, and runs data through them repeatedly, I randomly start seeing these warnings: mdadm: Fail to create md0 when using /sys/module/md_mod/parameters/new_array, fallback to creation via node And then very occasionally get that warning paired with this error: mdadm: unexpected failure opening /dev/md0 Which stops the test because it fails to create an array. I also see a lot of the same bugs as below so it may be related. Logan -- BUG: sleeping function called from invalid context at include/linux/sched/mm.h:274 in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 853, name: mdadm preempt_count: 0, expected: 0 RCU nest depth: 1, expected: 0 1 lock held by mdadm/853: #0: ffffffff98c623c0 (rcu_read_lock){....}-{1:2}, at: md_ioctl+0x8f0/0x2670 CPU: 2 PID: 853 Comm: mdadm Not tainted 6.0.0-rc2-eid-vmlocalyes-dbg-00096-g9859e343daaf #2680 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x5a/0x74 dump_stack+0x10/0x12 __might_resched.cold+0x146/0x17e __might_sleep+0x66/0xc0 kmem_cache_alloc_trace+0x2f8/0x400 kobject_uevent_env+0x121/0xa30 kobject_uevent+0xb/0x10 md_new_event+0x6b/0x80 md_error+0x168/0x1b0 md_ioctl+0x989/0x2670 blkdev_ioctl+0x24d/0x450 __x64_sys_ioctl+0xc0/0x100 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 ============================= [ BUG: Invalid wait context ] 6.0.0-rc2-eid-vmlocalyes-dbg-00096-g9859e343daaf #2680 Tainted: G W ----------------------------- mdadm/853 is trying to lock: ffffffff990e4950 (uevent_sock_mutex){+.+.}-{3:3}, at: kobject_uevent_env+0x460/0xa30 other info that might help us debug this: context-{4:4} 1 lock held by mdadm/853: #0: ffffffff98c623c0 (rcu_read_lock){....}-{1:2}, at: md_ioctl+0x8f0/0x2670 stack backtrace: CPU: 2 PID: 853 Comm: mdadm Tainted: G W 6.0.0-rc2-eid-vmlocalyes-dbg-00096-g9859e343daaf #2680 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x5a/0x74 dump_stack+0x10/0x12 __lock_acquire.cold+0x2f2/0x31a lock_acquire+0x183/0x440 __mutex_lock+0x125/0xe20 mutex_lock_nested+0x1b/0x20 kobject_uevent_env+0x460/0xa30 kobject_uevent+0xb/0x10 md_new_event+0x6b/0x80 md_error+0x168/0x1b0 md_ioctl+0x989/0x2670 blkdev_ioctl+0x24d/0x450 __x64_sys_ioctl+0xc0/0x100 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0