On 3/1/23 15:41, Logan Gunthorpe wrote: > Hi, > > This is the next iteration of the patchset to add a zeroing option > which bypasses the inital sync for arrays. This version of the patch > has some minor cleanup and collected a number of review and ack tags. > > This patch set adds the --write-zeroes option which will imply > --assume-clean and write zeros to the data region in each disk before > starting the array. This can take some time so each disk is done in > parallel in its own fork. To make the forking code easier to > understand this patch set also starts with some cleanup of the > existing Create code. > > We tested write-zeroes requests on a number of modern nvme drives of > various manufacturers and found most are not as optimized as the > discard path. A couple drives that were tested did not support > write-zeroes at all but still performed similarly with the kernel > falling back to writing zero pages. Typically we see it take on the > order of one minute per 100GB of data zeroed. > > One reason write-zeroes is slower than discard is that today's NVMe > devices only allow about 2MB to be zeroed in one command where as > the entire drive can typically be discarded in one command. Partly, > this is a limitation of the spec as there are only 16 bits avalaible > in the write-zeros command size but drives still don't max this out. > Hopefully, in the future this will all be optimized a bit more > and this work will be able to take advantage of that. > > Logan > > -- > > Changes since v6: > * Collected review and ack tags from Xiao, Chaitanya and Coly > * Adjust the error reporting to us strerror() instead of the > glibc %m extension. (per Coly) > * Fix a typo in the man page ("despit" should have been "despite") > (as noticed by Coly) > > Changes since v5: > * Ensure 'interrupted' is initialized in wait_for_zero_forks(). > (as noticed by Xiao) > * Print a message indicating that the zeroing was interrupted. > > Changes since v4: > * Handle SIGINT better. Previous versions would leave the zeroing > processes behind after the main thread exitted which would > continue zeroing in the background (possibly for some time). > This version splits the zero fallocate commands up so they can be > interrupted quicker, and intercepts SIGINT in the main thread > to print an appropriate message and wait for the threads > to finish up. (as noticed by Xiao) > > Changes since v3: > * Store the pid in a local variable instead of the mdinfo struct > (per Mariusz and Xiao) > > Changes since v2: > > * Use write-zeroes instead of discard to zero the disks (per > Martin) > * Due to the time required to zero the disks, each disk is > now done in parallel with separate forks of the process. > * In order to add the forking some refactoring was done on the > Create() function to make it easier to understand > * Added a pr_info() call so that some prints can be done > to stdout instead of stdour (per Mariusz) > * Added KIB_TO_BYTES and SEC_TO_BYTES helpers (per Mariusz) > * Added a test to the mdadm test suite to test the option > works. > * Fixed up how the size and offset are calculated with some > great information from Xiao. > > Changes since v1: > > * Discard the data in the devices later in the create process > while they are already open. This requires treating the > s.discard option the same as the s.assume_clean option. > Per Mariusz. > * A couple other minor cleanup changes from Mariusz. > > -- > > Logan Gunthorpe (7): > Create: goto abort_locked instead of return 1 in error path > Create: remove safe_mode_delay local variable > Create: Factor out add_disks() helpers > mdadm: Introduce pr_info() > mdadm: Add --write-zeros option for Create > tests/00raid5-zero: Introduce test to exercise --write-zeros. > manpage: Add --write-zeroes option to manpage > > Create.c | 565 +++++++++++++++++++++++++++++++-------------- > ReadMe.c | 2 + > mdadm.8.in | 18 +- > mdadm.c | 9 + > mdadm.h | 7 + > tests/00raid5-zero | 12 + > 6 files changed, 437 insertions(+), 176 deletions(-) > create mode 100644 tests/00raid5-zero > > > base-commit: f1f3ef7d2de5e3a726c27b9f9bb20e270a100dab All applied! Thanks, Jes