Re: Linux Plumbers MD BOF discussion notes

Shaohua Li <shli@xxxxxxxxxx> · Sun, 17 Sep 2017 21:54:44 -0700

On Sat, Sep 16, 2017 at 10:08:06AM +1000, Neil Brown wrote:
> 
> Sounds like an interesting, wide ranging discussion...
> 
> 
> 
> On Fri, Sep 15 2017, Shaohua Li wrote:
> 
> > This is a short note based on Song's record. Please reply to the list if
> > anything is missing.
> >
> > *IMSM - PPL
> > Fix write hole without extra device; Updated status and upcomming mdadm change
> > to support it. Intel guys are improving it, like fixing current 'disable disk
> > cache' problem.
> >
> > *Hiding member drives
> > Hiding RAID array member drives from user, so MD RAID array looks more like a
> > hardware raid array. This turns out to be real customer requirement.
> > We do need to access the member drives for different reasons (create/assembly,
> > mdmon, iostat). Working around this issue might be possible, eg, delete the
> > /dev/xxx after array assembly. But must justify the value and also discuss with
> > block guys since this is a general issue.
> 
> 
> "Hiding" is a very vague term.  Should we get Harry Potter's
> invisibility cloak and wrap it around the hardware?
> Do we need to:
>   - remove from /proc/partitions - possible and possibly sane
>   - remove from from /dev - easy, given clear justification
>   - remove from /sys/block - I don't think this is justifiable
>   - make open() impossible - it already is if you use O_EXCL
> ??
> 
> Possibly something sensible could be done, but we do need a clear
> statement of, and justification for, the customer requirement.

Agree. The requirement isn't very clear right now.

> >
> > *Block-mq
> > Converting MD to use block-mq? md is bio based (not request based) driver, so
> > no value to go mq. md dispatches bio directly to low level disks. blk-mq still
> > benefits if low level disk supports it but this is transparent to md.
> >  
> > *NVDIMM caching
> > NVDIMM supports block interface. Using it as a raid5 cache disk should be
> > straightforward.
> > Directly storing raid5 stripe cache in NVDIMM without current raid5-cache log
> > device? Had problems for example how to detect/fix data mismatch after power
> > failure. Need major changes in raid5 code.
> >  
> > *stream ID
> > Support stream ID in MD. It should be fairly easy to support stream ID in
> > raid0/1/10. Intel guys described a scenario in raid5 which breaks stream ID,
> > eg, write stripe data multiple times because of read-modify-write (clarify?).
> > Probably detecting IO pattern like what DM does can help.
> >
> > *split/merge problem
> > md layer splits bio and block layer will do bio merge for low level disks. The
> > merge/split overhead is noticeable for raid0 with fast SSD and small chunk
> > size. Fixing the issue for raid0 is doable. Fixing for raid5 is not sure.
> > Discussed increasing stripe size of raid5 to reduce the split/merge overhead.
> > There is tradeoff here for example more unnecessary IO for read-modify-write
> > with bigger stripe size.
> 
> For raid5 I can understand this being an issue as raid5 only submits
> PAGE_SIZE bios to lower level devices.
> The batching that Shaohua added might be a good starting point.  If you
> have a batch of stripes, you should be able to submit one bio per device
> for the whole batch.
> 
> For RAID0, I don't understand the problem.  RAID0 never splits smaller
> than the chunk size, and that is a firm requirement.
> Maybe RAID0 could merge the bios itself rather than passing them down.
> Maybe that would help.  Do if a request is properly aligned and covers
> several stripes, that could be mapped to one-bio-per-device.  Is that
> the goal?

Yes, I think one-bio-per-device is the goal. split bio according to chunk size
and merge bio in low level disk does takes cpu time.

> >
> > *Testing
> > md need recover data after disk failures, mdadm has test suite, but not
> > covering all cases. mdadm test suite is fragile, may kill the machine
> > We need to build more completed tests.
> >
> > The recent null_blk block device driver can emulate several types of disk
> > failures. The plan is to make null_blk support all disk failures which md can
> > handle and create a test suite using null_blk. Help is welcome!
> >  
> > *RAID-1 RAID-10 barrier inconsistency
> > Coly improved the barrier scalibility for raid1, hopefully he can do the same
> > for raid10
> >  
> > *DAX
> > Support DAX in raid0/linear should not be hard. Does it make sense to support
> > other raid types?
> >
> > *sysfs / ioctl
> > Jes started working on it. Goal is to replace ioctl with sysfs based interfaces.
> > There are gaps currently, eg, some operations can only be done with ioctl. Suse
> > guys promised to close the gap in kernel side.
> >
> > Using configfs instead of sysfs?
> 
> It seems that no one actually wants this, but I'll just throw in my
> opinion that this is a nonsensical suggestion.  configfs is only for
> people who don't understand sysfs.  It has no real value. 
> >  
> > *Stop nested RAID device
> > For example a raid0 on top of raid5. Userspace must understand the topology to
> > stop the nested raid arrays.
> > mdadm stop is async, need sync option for stop array (clarify?)
> 
> I've been thinking that it might be useful for md (and dm and loop
> and..) to have a setting whereby it automatically shuts down on last
> close.  This would make it easier to orchestrate shutdown.
> Certainly it would make sense to use such a mode for stacked arrays.

loop does have the setting to automatically shut down on last close.

Thanks,
Shaohua
> The "mdadm stop is async" comment refers to the fact that
> mddev_delayed_delete is run in a work queue, possibly after "mdadm -S
> /dev/mdX" completes.
> It might also refer to that fact that udev subsequently deletes things
> from /dev, possibly after a further delay.
> 
> It might be possible to remove the need for mddev_delayed_delete if we
> enhance disk_release (in genhd.c) in some way so that it can drop the
> reference on the mddev (instead of mddev having to be careful when it
> drops the reference on the gendisk).
> 
> Getting mdadm to wait for udev might be easy if udev provides some sort
> of API for that.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html