Re: [PATCH mdadm v2 1/2] mdadm: Add --discard option for Create

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 9 Sep 2022 09:47:21 -0600
Logan Gunthorpe <logang@xxxxxxxxxxxx> wrote:
> >> +{
> >> +	uint64_t range[2] = {offset, size};  
> > Probably you don't need to specify [2] but it is not an issue I think.
> >   
> >> +	unsigned long buf[4096 / sizeof(unsigned long)];  
> > 
> > Can you use any define for 4096?   
> 
> I don't see any appropriate defines in the code base. It really just
> needs to be bigger than any O_DIRECT restrictions. 4096 bytes is usually
> the worst case.

See comment bellow.
> 
> >> +	unsigned long i;
> >> +
> >> +	if (c->verbose)
> >> +		printf("discarding data from %lld to %lld on: %s\n",
> >> +		       offset, size, devname);
> >> +
> >> +	if (ioctl(fd, BLKDISCARD, &range)) {
> >> +		pr_err("discard failed on '%s': %m\n", devname);
> >> +		return 1;
> >> +	}
> >> +
> >> +	if (pread(fd, buf, sizeof(buf), offset) != sizeof(buf)) {
> >> +		pr_err("failed to readback '%s' after discard: %m\n",
> >> devname);
> >> +		return 1;
> >> +	}
> >> +
> >> +	for (i = 0; i < ARRAY_SIZE(buf); i++) {
> >> +		if (buf[i]) {
> >> +			pr_err("device did not read back zeros after
> >> discard on '%s': %lx\n",
> >> +			       devname, buf[i]);  
> > In previous version I wanted to leave the message on stderr, but just move a
> > data (buf[i]) to debug, or if (verbose > 0).
> > I think that printing binary data in error message is not necessary.  
> 
> I added the hex because it might be informative to know what a discard
> did to the device (all FFs or random data).
> 
> > BTW. I'm not sure if discard ensures that data will be all zero. It causes
> > that drive drops all references but I doesn't mean that data is zeroed.
> > Could you please check it in documentation? Should we expect zeroes?  
> 
> That's correct. I discussed this in the cover letter. That's why this
> check is here. Per some of the discussion from others I still think the
> best course of action is to just check what the discard did and fail if
> it is non-zero. Even though many NVMe and ATA devices have the ability
> to control or query the behaviour, the kernel doesn't support this and
> I don't think it can be relied upon.

It is also controversial approach[1].

I think that the best we can do here is to warn user, for example:
"Please ensure that drive you used return zeros after discard."
We should ask for confirmation (it should be possible to skip with --run).
I would like to leave discard feature validation on user side, it is not mdadm
task. Simply, if you want to use it, then it is done on your own risk and
hopefully you know what you want to achieve.
That will simplify implementation on mdadm side. What do you think?

> >> @@ -945,6 +983,15 @@ int Create(struct supertype *st, char *mddev,
> >>  				}
> >>  				if (fd >= 0)
> >>  					remove_partitions(fd);
> >> +
> >> +				if (s->discard &&
> >> +				    discard_device(c, fd, dv->devname,
> >> +						   dv->data_offset << 9,
> >> +						   s->size << 10)) {
> >> +					ioctl(mdfd, STOP_ARRAY, NULL);
> >> +					goto abort_locked;
> >> +				}
> >> +  
> > Feel free to use up to 100 char in one line it is allowed now.
> > Why we need dv->data_offset << 9 and  s->size << 10 here?
> > How this applies to zoned raid0?  
> 
> As I understand it the offset and size will give the bounds of the
> data region on the disk. Do you not think it works for zoned raid0?

mdadm operates on 512B, so using 4K data regions could be destructive.
Also left shift causes that size value is increasing. We can't clear more that
user requested. We need to use 512b sectors as mdadm does.

I don't know how native raid0 size is passed and how zones are created but I
suspect that s->size may not be correct for all drives. It it a global property
but for zoned raid member size could be different. You need to check how it
applies.

Also, I'm not sure if this feature is needed for raid0, because there is no
resync. Maybe we can exclude raid0 from discard?
> 
> >> diff --git a/mdadm.c b/mdadm.c
> >> index 972adb524dfb..049cdce1cdd2 100644
> >> --- a/mdadm.c
> >> +++ b/mdadm.c
> >> @@ -602,6 +602,10 @@ int main(int argc, char *argv[])
> >>  			s.assume_clean = 1;
> >>  			continue;
> >>  
> >> +		case O(CREATE, Discard):
> >> +			s.discard = true;
> >> +			continue;
> >> +  
> > 
> > I would like to set s.assume_clean=true along with discard. Then will be no
> > need to modify other conditions. If we are assuming that after discard all
> > is zeros then we can skip resync, right? According to message, it should be.
> > Please add message for user and set assume_clean too.  
> 
> 
> Well it was my opinion that it was clearer in the code to just
> explicitly include discard in the conditionals instead of making discard
> also set assume-clean, but if you think otherwise I can change it for v3.
> 
> What kind of user message are you thinking is necessary here?

In my convention, all shape attributes should be set in mdadm.c, later they
should be considered as const, we should not overwrite them. This structure
represents user settings.
This is why I consider updating assume_clean in Create.c as wrong. When discard
is set then we are assuming that user want to skip resync too, otherwise it
doesn't have sense. Also, if discard of any drive fails we are returning error
and create operation will fail anyway.

I would expected something like: "Discard requested, setting --assume-clean to
skip resync".
Also, will be great if you can add some tests, at least for command-line.

[1] https://lore.kernel.org/linux-raid/yq1sfkw7yqi.fsf@xxxxxxxxxxxxxxxxxxxx/T/#t

Thanks,
Mariusz



[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux