Re: [PATCH mdadm v4 0/7] Write Zeroes option for Creating Arrays

Logan Gunthorpe <logang@xxxxxxxxxxxx> · Wed, 12 Oct 2022 10:59:45 -0600

@ccing Martin hoping he has an opinion on the write zeroes interface

On 2022-10-11 19:09, Xiao Ni wrote:
> Hi Logan
> 
> I did a test with the patchset. There is a problem like this:
> 
> mdadm -CR /dev/md0 -l5 -n3 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme0n1 --write-zero
> mdadm: zeroing data from 135266304 to 960061505536 on: /dev/nvme1n1
> mdadm: zeroing data from 135266304 to 960061505536 on: /dev/nvme2n1
> mdadm: zeroing data from 135266304 to 960061505536 on: /dev/nvme0n1
> 
> I ran ctrl+c when waiting, then the raid can't be created anymore. Because the
> processes that write zero to nvmes are stuck.
> 
> ps auxf | grep mdadm
> root       68764  0.0  0.0   9216  1104 pts/0    S+   21:09   0:00
>          \_ grep --color=auto mdadm
> root       68633  0.1  0.0  27808   336 pts/0    D    21:04   0:00
> mdadm -CR /dev/md0 -l5 -n3 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme0n1
> --write-zero
> root       68634  0.2  0.0  27808   336 pts/0    D    21:04   0:00
> mdadm -CR /dev/md0 -l5 -n3 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme0n1
> --write-zero
> root       68635  0.0  0.0  27808   336 pts/0    D    21:04   0:00
> mdadm -CR /dev/md0 -l5 -n3 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme0n1
> --write-zero

Yes, this is because the fallocate() call that the child processes use
to write_zeros will submit a large number of bios in the kernel and then
wait with submit_bio_wait() which is non-interruptible. So when the
child processes get the SIGINT, they will not stop until after the
fallocate() call completes which will pretty much be after the entire
disk is zeroed. So if you are zeroing a very large disk, those processes
will be stuck around for several minutes after the parent process
terminates; though they do go away eventually.

There aren't many great solutions for this:

1) We could install as signal handler in the parent so it sticks around
until the zeroing is complete. This would mean mdadm will not be able to
be terminated while the zeroing is occurring and the user has to wait.

2) We could split up the fallocate call into multiple calls to zero the
entire disk. This would allow a quicker ctrl-c to occur, however it's
not clear what the best size would be to split it into. Even zeroing 1GB
can take a few seconds, but the smaller we go, the less efficient it
will be if the block layer and devices ever get write-zeroes optimized
in the same way discard has been optimized (with NVMe, discard only
requires a single command to handle the entire disk where as
write-zeroes requires a minimum of one command per 2MB of data to zero).
I was hoping write-zeroes could be made faster in the future, at least
for NVMe.

Thoughts?

Logan