Re: A crash caused by the commit 0dd84b319352bb8ba64752d4e45396d8b13e6018

Guoqing Jiang <guoqing.jiang@xxxxxxxxx> · Thu, 3 Nov 2022 15:28:55 +0800

On 11/3/22 11:47 AM, Guoqing Jiang wrote:
[   78.491429] <TASK>
[   78.491640]  clone_endio+0xf4/0x1c0 [dm_mod]
[   78.492072]  clone_endio+0xf4/0x1c0 [dm_mod]

The clone_endio belongs to "clone" target_type.

Hmm, could be the "clone_endio" from dm.c instead of dm-clone-target.c.

[   78.492505] __submit_bio+0x76/0x120
[   78.492859]  submit_bio_noacct_nocheck+0xb6/0x2a0
[   78.493325]  flush_expired_bios+0x28/0x2f [dm_delay]

This is "delay" target_type. Could you shed light on how the two targets
connect with dm-raid? And I have shallow knowledge about dm ...

[   78.493808] process_one_work+0x1b4/0x300
[   78.494211]  worker_thread+0x45/0x3e0
[   78.494570]  ? rescuer_thread+0x380/0x380
[   78.494957]  kthread+0xc2/0x100
[   78.495279]  ? kthread_complete_and_exit+0x20/0x20
[   78.495743]  ret_from_fork+0x1f/0x30
[   78.496096]  </TASK>
[   78.496326] Modules linked in: brd dm_delay dm_raid dm_mod 
af_packet uvesafb cfbfillrect cfbimgblt cn cfbcopyarea fb font fbdev 
tun autofs4 binfmt_misc configfs ipv6 virtio_rng virtio_balloon 
rng_core virtio_net pcspkr net_failover failover qemu_fw_cfg button 
mousedev raid10 raid456 libcrc32c async_raid6_recov async_memcpy 
async_pq raid6_pq async_xor xor async_tx raid1 raid0 md_mod sd_mod 
t10_pi crc64_rocksoft crc64 virtio_scsi scsi_mod evdev psmouse bsg 
scsi_common [last unloaded: brd]
[   78.500425] CR2: 0000000000000000
[   78.500752] ---[ end trace 0000000000000000 ]---
[   78.501214] RIP: 0010:mempool_free+0x47/0x80

BTW, is the mempool_free from endio -> dec_count -> complete_io?

I guess it is "mempool_free(io, &io->client->pool)", and the pool is 
freed by
dm_io_client_destroy, and seems dm-raid is not responsible for either create
pool or destroy pool.

And io which caused the crash is from dm_io -> async_io / sync_io
 -> dispatch_io, seems dm-raid1 can call it instead of dm-raid, so I
suppose the io is for mirror image. 

The io should be from another path (dm_submit_bio -> 
dm_split_and_process_bio
-> __split_and_process_bio -> __map_bio which sets "bi_end_io = 
clone_endio").

My guess is, there is racy condition between "lvchange --rebuild" and 
raid_dtr since
it was reproduced by running cmd in loop.

Anyway, we can revert the mentioned commit and go back to Neil's 
solution [1],
but I'd like to reproduce it and learn DM a bit.

[1]. 
https://lore.kernel.org/linux-raid/a6657e08-b6a7-358b-2d2a-0ac37d49d23a@xxxxxxxxx/T/#m95ac225cab7409f66c295772483d091084a6d470

Thanks,
Guoqing