Re: [PATCH mdadm v2 1/2] mdadm: Add --discard option for Create

Logan Gunthorpe <logang@xxxxxxxxxxxx> · Fri, 9 Sep 2022 09:47:21 -0600

On 2022-09-09 03:57, Mariusz Tkaczyk wrote:
>> If all the discard requests are successful and there are no missing
>> disks thin it is safe to set assume_clean as we know the array is clean.
> 
> Please update message. We agreed in v1 that missing disks and discard features
> are not related, right?

Oops, yes, I'll update the commit message for v3.

>> +static int discard_device(struct context *c, int fd, const char *devname,
>> +			  unsigned long long offset, unsigned long long size)
> 
> Will be great if you can description.

Ok, will do for v3.

>> +{
>> +	uint64_t range[2] = {offset, size};
> Probably you don't need to specify [2] but it is not an issue I think.
> 
>> +	unsigned long buf[4096 / sizeof(unsigned long)];
> 
> Can you use any define for 4096? 

I don't see any appropriate defines in the code base. It really just
needs to be bigger than any O_DIRECT restrictions. 4096 bytes is usually
the worst case.

>> +	unsigned long i;
>> +
>> +	if (c->verbose)
>> +		printf("discarding data from %lld to %lld on: %s\n",
>> +		       offset, size, devname);
>> +
>> +	if (ioctl(fd, BLKDISCARD, &range)) {
>> +		pr_err("discard failed on '%s': %m\n", devname);
>> +		return 1;
>> +	}
>> +
>> +	if (pread(fd, buf, sizeof(buf), offset) != sizeof(buf)) {
>> +		pr_err("failed to readback '%s' after discard: %m\n",
>> devname);
>> +		return 1;
>> +	}
>> +
>> +	for (i = 0; i < ARRAY_SIZE(buf); i++) {
>> +		if (buf[i]) {
>> +			pr_err("device did not read back zeros after discard
>> on '%s': %lx\n",
>> +			       devname, buf[i]);
> In previous version I wanted to leave the message on stderr, but just move a
> data (buf[i]) to debug, or if (verbose > 0).
> I think that printing binary data in error message is not necessary.

I added the hex because it might be informative to know what a discard
did to the device (all FFs or random data).

> BTW. I'm not sure if discard ensures that data will be all zero. It causes that
> drive drops all references but I doesn't mean that data is zeroed. Could you
> please check it in documentation? Should we expect zeroes?

That's correct. I discussed this in the cover letter. That's why this
check is here. Per some of the discussion from others I still think the
best course of action is to just check what the discard did and fail if
it is non-zero. Even though many NVMe and ATA devices have the ability
to control or query the behaviour, the kernel doesn't support this and
I don't think it can be relied upon.

>> @@ -945,6 +983,15 @@ int Create(struct supertype *st, char *mddev,
>>  				}
>>  				if (fd >= 0)
>>  					remove_partitions(fd);
>> +
>> +				if (s->discard &&
>> +				    discard_device(c, fd, dv->devname,
>> +						   dv->data_offset << 9,
>> +						   s->size << 10)) {
>> +					ioctl(mdfd, STOP_ARRAY, NULL);
>> +					goto abort_locked;
>> +				}
>> +
> Feel free to use up to 100 char in one line it is allowed now.
> Why we need dv->data_offset << 9 and  s->size << 10 here?
> How this applies to zoned raid0?

As I understand it the offset and size will give the bounds of the
data region on the disk. Do you not think it works for zoned raid0?

>> diff --git a/mdadm.c b/mdadm.c
>> index 972adb524dfb..049cdce1cdd2 100644
>> --- a/mdadm.c
>> +++ b/mdadm.c
>> @@ -602,6 +602,10 @@ int main(int argc, char *argv[])
>>  			s.assume_clean = 1;
>>  			continue;
>>  
>> +		case O(CREATE, Discard):
>> +			s.discard = true;
>> +			continue;
>> +
> 
> I would like to set s.assume_clean=true along with discard. Then will be no need
> to modify other conditions. If we are assuming that after discard all is zeros
> then we can skip resync, right? According to message, it should be.
> Please add message for user and set assume_clean too.

Well it was my opinion that it was clearer in the code to just
explicitly include discard in the conditionals instead of making discard
also set assume-clean, but if you think otherwise I can change it for v3.

What kind of user message are you thinking is necessary here?

Logan