@ccing Martin hoping he has an opinion on the write zeroes interface On 2022-10-11 19:09, Xiao Ni wrote: > Hi Logan > > I did a test with the patchset. There is a problem like this: > > mdadm -CR /dev/md0 -l5 -n3 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme0n1 --write-zero > mdadm: zeroing data from 135266304 to 960061505536 on: /dev/nvme1n1 > mdadm: zeroing data from 135266304 to 960061505536 on: /dev/nvme2n1 > mdadm: zeroing data from 135266304 to 960061505536 on: /dev/nvme0n1 > > I ran ctrl+c when waiting, then the raid can't be created anymore. Because the > processes that write zero to nvmes are stuck. > > ps auxf | grep mdadm > root 68764 0.0 0.0 9216 1104 pts/0 S+ 21:09 0:00 > \_ grep --color=auto mdadm > root 68633 0.1 0.0 27808 336 pts/0 D 21:04 0:00 > mdadm -CR /dev/md0 -l5 -n3 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme0n1 > --write-zero > root 68634 0.2 0.0 27808 336 pts/0 D 21:04 0:00 > mdadm -CR /dev/md0 -l5 -n3 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme0n1 > --write-zero > root 68635 0.0 0.0 27808 336 pts/0 D 21:04 0:00 > mdadm -CR /dev/md0 -l5 -n3 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme0n1 > --write-zero Yes, this is because the fallocate() call that the child processes use to write_zeros will submit a large number of bios in the kernel and then wait with submit_bio_wait() which is non-interruptible. So when the child processes get the SIGINT, they will not stop until after the fallocate() call completes which will pretty much be after the entire disk is zeroed. So if you are zeroing a very large disk, those processes will be stuck around for several minutes after the parent process terminates; though they do go away eventually. There aren't many great solutions for this: 1) We could install as signal handler in the parent so it sticks around until the zeroing is complete. This would mean mdadm will not be able to be terminated while the zeroing is occurring and the user has to wait. 2) We could split up the fallocate call into multiple calls to zero the entire disk. This would allow a quicker ctrl-c to occur, however it's not clear what the best size would be to split it into. Even zeroing 1GB can take a few seconds, but the smaller we go, the less efficient it will be if the block layer and devices ever get write-zeroes optimized in the same way discard has been optimized (with NVMe, discard only requires a single command to handle the entire disk where as write-zeroes requires a minimum of one command per 2MB of data to zero). I was hoping write-zeroes could be made faster in the future, at least for NVMe. Thoughts? Logan