Re: [patch]raid5: fix directio regression

NeilBrown <neilb@xxxxxxx> · Thu, 9 Aug 2012 11:32:30 +1000

On Thu, 9 Aug 2012 09:20:05 +0800 "Jianpeng Ma" <majianpeng@xxxxxxxxx> wrote:

> On 2012-08-08 20:53 Shaohua Li <shli@xxxxxxxxxx> Wrote:
> >2012/8/8 Jianpeng Ma <majianpeng@xxxxxxxxx>:
> >> On 2012-08-08 10:58 Shaohua Li <shli@xxxxxxxxxx> Wrote:
> >>>2012/8/7 Jianpeng Ma <majianpeng@xxxxxxxxx>:
> >>>> On 2012-08-07 13:32 Shaohua Li <shli@xxxxxxxxxx> Wrote:
> >>>>>2012/8/7 Jianpeng Ma <majianpeng@xxxxxxxxx>:
> >>>>>> On 2012-08-07 11:22 Shaohua Li <shli@xxxxxxxxxx> Wrote:
> >>>>>>>My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
> >>>>>>>895e3c5c58a80bb. directIO usually is random IO and if request size isn't big
> >>>>>>>(which is the common case), delay handling of the stripe hasn't any advantages.
> >>>>>>>For big size request, delay can still reduce IO.
> >>>>>>>
> >>>>>>>Signed-off-by: Shaohua Li <shli@xxxxxxxxxxxx>
> >>>> [snip]
> >>>>>>>--
> >>>>>> May be used size to judge is not a good method.
> >>>>>> I firstly sended this patch, only want to control direct-write-block,not for reqular file.
> >>>>>> Because i think if someone used direct-write-block for raid5,he should know the feature of raid5 and he can control
> >>>>>> for write to full-write.
> >>>>>> But at that time, i did know how to differentiate between regular file and block-device.
> >>>>>> I thik we should do something to do this.
> >>>>>
> >>>>>I don't think it's possible user can control his write to be a
> >>>>>full-write even for
> >>>>>raw disk IO. Why regular file and block device io matters here?
> >>>>>
> >>>>>Thanks,
> >>>>>Shaohua
> >>>> Another problem is the size. How to judge the size is large or not?
> >>>> A syscall write is a dio and a dio may be split more bios.
> >>>> For my workload, i usualy write chunk-size.
> >>>> But your patch is judge by bio-size.
> >>>
> >>>I'd ignore workload which does sequential directIO, though
> >>>your workload is, but I bet no real workloads are. So I'd like
> >> Sorry,my explain maybe not corcrect. I write data once which size is almost chunks-size * devices,in order to full-write
> >> and as possible as to no pre-read operation.
> >>>only to consider big size random directio. I agree the size
> >>>judge is arbitrary. I can optimize it to be only consider stripe
> >>>which hits two or more disks in one bio, but not sure if it's
> >>>worthy doing. Not ware big size directio is common, and even
> >>>is, big size request IOPS is low, a bit delay maybe not a big
> >>>deal.
> >> If add a acc_time for 'striep_head' to control?
> >> When get_active_stripe() is ok, update acc_time.
> >> For some time, stripe_head did not access and it shold pre-read.
> >
> >Do you want to add a timer for each stripe? This is even ugly.
> >How do you choose the expire time? A time works for harddisk
> >definitely will not work for a fast SSD.
> A time is like the size which is arbitrary.
> How about add a interface in sysfs to control by user? 
> Only user can judge the workload, which sequatial write or random write.

This is getting worse by the minute.  A sysfs interface for this is
definitely not a good idea.

The REQ_NOIDLE flag is a pretty clear statement that no more requests that
merge with this one are expected.  If some use cases sends random requests,
maybe it should be setting REQ_NOIDLE.

Maybe someone should do some research and find out why WRITE_ODIRECT doesn't
include REQ_NOIDLE.  Understanding that would help understand the current
problem.

NeilBrown
Attachment:
signature.asc

Description: PGP signature