Hi Logan I like this idea, but I have a question. If we do discard against the member disks and then creating raid device with --assume-clean, it should work with the same result. The reason that you add --write-zero is for automatic doing this? Regards Xiao On Thu, Sep 22, 2022 at 4:44 AM Logan Gunthorpe <logang@xxxxxxxxxxxx> wrote: > > Hi, > > This is the next iteration of the patchset that added the discard > option to mdadm. Per feedback from Martin, it's more desirable > to use the write-zeroes functionality than rely on devices to zero > the data on a discard request. This is because standards typically > only require the device to do the best effort to discard data and > may not actually discard (and thus zero) it all in some circumstances. > > This version of the patch set adds the --write-zeroes option which > will imply --assume-clean and write zeros to the data region in > each disk before starting the array. This can take some time so > each disk is done in parallel in its own fork. To make the forking > code easier to understand this patch set also starts with some > cleanup of the existing Create code. > > We tested write-zeroes requests on a number of modern nvme drives of > various manufacturers and found most are not as optimized as the > discard path. A couple drives that were tested did not support > write-zeroes at all but still performed similarly with the kernel > falling back to writing zero pages. Typically we see it take on the > order of one minute per 100GB of data zeroed. > > One reason write-zeroes is slower than discard is that today's NVMe > devices only allow about 2MB to be zeroed in one command where as > the entire drive can typically be discarded in one command. Partly, > this is a limitation of the spec as there are only 16 bits avalaible > in the write-zeros command size but drives still don't max this out. > Hopefully, in the future this will all be optimized a bit more > and this work will be able to take advantage of that. > > Logan > > -- > > Changes since v2: > > * Use write-zeroes instead of discard to zero the disks (per > Martin) > * Due to the time required to zero the disks, each disk is > now done in parallel with separate forks of the process. > * In order to add the forking some refactoring was done on the > Create() function to make it easier to understand > * Added a pr_info() call so that some prints can be done > to stdout instead of stdour (per Mariusz) > * Added KIB_TO_BYTES and SEC_TO_BYTES helpers (per Mariusz) > * Added a test to the mdadm test suite to test the option > works. > * Fixed up how the size and offset are calculated with some > great information from Xiao. > > Changes since v1: > > * Discard the data in the devices later in the create process > while they are already open. This requires treating the > s.discard option the same as the s.assume_clean option. > Per Mariusz. > * A couple other minor cleanup changes from Mariusz. > > -- > > Logan Gunthorpe (7): > Create: goto abort_locked instead of return 1 in error path > Create: remove safe_mode_delay local variable > Create: Factor out add_disks() helpers > mdadm: Introduce pr_info() > mdadm: Add --write-zeros option for Create > tests/00raid5-zero: Introduce test to exercise --write-zeros. > manpage: Add --write-zeroes option to manpage > > Create.c | 476 ++++++++++++++++++++++++++++----------------- > ReadMe.c | 2 + > mdadm.8.in | 16 ++ > mdadm.c | 9 + > mdadm.h | 9 + > tests/00raid5-zero | 12 ++ > 6 files changed, 349 insertions(+), 175 deletions(-) > create mode 100644 tests/00raid5-zero > > > base-commit: 171e9743881edf2dfb163ddff483566fbf913ccd > -- > 2.30.2 >