Re: [PATCH] bcache: consider the fragmentation when update the writeback rate

Dongdong Tao <dongdong.tao@xxxxxxxxxxxxx> · Thu, 14 Jan 2021 23:35:16 +0800

Hi Coly,

Apologies for any confusion that I might have caused, and thanks a lot
for your patience and your help !

On Thu, Jan 14, 2021 at 9:31 PM Coly Li <colyli@xxxxxxx> wrote:
>
> On 1/14/21 8:22 PM, Dongdong Tao wrote:
> > Hi Coly,
> >
> > Why you limit the iodeph to 8 and iops to 150 on cache device?
> > For cache device the limitation is small. Iosp 150 with 4KB block size,
> > it means every hour writing (150*4*60*60=2160000KB=) 2GB data. For 35
> > hours it is only 70GB.
> >
> >
> > What if the iodepth is 128 or 64, and no iops rate limitation ?
> > -> There are two reasons why I limit the iodepth and iops rate.
> > 1. If I don't limit them, the dirty cache will be filled up very
> > quickly within 20 minutes.
> >      It's almost NVME speed before it reaches the 70
> > cutoff_writeback_sync, there is no way for any kind of writeback to
> > stop it from
> >      filling up due to the huge gap between NVME and HDD in terms of
> > the throughput,
> >      I don't think there is anything we can do about it? and it should
> > only happen in a benchmark world, not should in production.
> >      The improvement I'm trying to do here is just for normal
> > production workload ,not for this benchmark scenario really.
> >      I currently can't see any necessity to test this scenario, please
> > kindly let me know about this if I'm wrong.
> >
> > 2. The reason that I set iodepth to 8 and iops to 150 is based on the
> > experience that I observed from production env, mostly ceph,
> >     ceph-osd has less than 10 thread(default setting) that will send
> > io to bcache in parallel. But I'm not sure about other applications.
> >     I agree that we can increase the iodepth to 64 or 128 and it's
> > doable. But we have to limit the iops, 150 IOPS is a reasonable
> > workload.
> >     The most busy ceph-osd that I've seen is about 1000 IOPS, but on
> > average is still only about 600.
> >     I can set the IOPS to a higher value like 600 and the iodepth to
> > 128 to perform the later test if it make sense to you?
> >
>
> OK, now I know the reason with the extra information. Since the cache
> device is filled up within 20 minutes, it is unnecessary to do the
> faster testing on your side. Let me do it later on my hardware.
>
>
> > Lastly, please allow me to clarify more about the production issue
> > that this patch is trying to address:
> >
> > In the production env that hit this issue, it usually takes a very
> > long time (many take days) for the cache_available_percent to drop to
> > 30, and the dirty data is mostly staying at a very low level (around
> > 10 percent), which means that the bcache isn't being stressed very
> > hard most of the time.
> >  There is no intention to save the cutoff_writeback_sync when the
> > bcache is being stressed without limitation, hope above make sense :)
> >
>
> Yes you explained clearly previously. What I worried was whether a
> faster writeback may interfere throughput and latency of regular I/O
> regular I/Os.
>
> From your current testing data it looks find with me.
>
>
> > By the way, my colleague and I are trying to gathering some production
> > bcache stats, I hope we can give you the performance number before and
> > after applying the patch.
>
> Yes that will be great.
>
> And could you please gather all current data chats into a single email,
> and reference it in your patch via lore ? Then for people don't
> subscribe linux-bcache mailing list, they may find all the posted
> performance data from you patch.
>

Sounds good, I'll update the patch comment with reference data.
But it seems like the linux mailing list doesn't accept chart ?
(always been detected as SPAM)
But, I can't be sure, I'll try to send it again, but if not, I'll put
all those data into a google doc.

> In general your testing data is convinced IMHO, and I will add your
> updated patch for 5.12 merge window.
>
Thank you Coly, that's great !!!

>
> Thanks.
>
> Coly Li
>
>
> >
> >
> > On Thu, Jan 14, 2021 at 6:05 PM Coly Li <colyli@xxxxxxx> wrote:
> >>
> >> On 1/14/21 12:45 PM, Dongdong Tao wrote:
> >>> Hi Coly,
> >>>
> >>> I've got the testing data for multiple threads with larger IO depth.
> >>>
> >>
> >> Hi Dongdong,
> >>
> >> Thanks for the testing number.
> >>
> >>> *Here is the testing steps:
> >>> *1. make-bcache -B <> -C <> --writeback
> >>>
> >>> 2. Open two tabs, start different fio task in them at the same time.
> >>> Tab1 run below fio command:
> >>> sudo fio --name=random-writers --filename=/dev/bcache0 --ioengine=libaio
> >>> --iodepth=32 --rw=randrw --blocksize=64k,8k  --direct=1 --runtime=24000
> >>>
> >>> Tab2 run below fio command:
> >>> sudo fio --name=random-writers2 --filename=/dev/bcache0
> >>> --ioengine=libaio --iodepth=8 --rw=randwrite --bs=4k --rate_iops=150
> >>> --direct=1 --write_lat_log=rw --log_avg_msec=20
> >>>
> >>
> >>
> >> Why you limit the iodep to 8 and iops to 150 on cache device?
> >> For cache device the limitation is small. Iosp 150 with 4KB block size,
> >> it means every hour writing (150*4*60*60=2160000KB=) 2GB data. For 35
> >> hours it is only 70GB.
> >>
> >>
> >> What if the iodeps is 128 or 64, and no iops rate limitation ?
> >>
> >>
> >>> Note
> >>> - Tab1 fio will run for 24000 seconds, which is the one to cause the
> >>> fragmentation and made the cache_available_percent drops to under 40.
> >>> - Tab2 fio is the one that I'm capturing the latency and I have let it
> >>> run for about 35 hours, which is long enough to allow the
> >>> cache_available_percent drops under 30.
> >>> - This testing method utilized fio benchmark with larger read block
> >>> size/small write block size to cause the high fragmentation, However in
> >>> a real production env, there could be
> >>>    various reasons or a combination of various reasons to cause the high
> >>> fragmentation,  but I believe it should be ok to use any method to cause
> >>> the fragmentation to verify if
> >>>    bcache with this patch is responding better than the master in this
> >>> situation.
> >>>
> >>> *Below is the testing result:*
> >>>
> >>> The total run time is about 35 hours, the latency points in the charts
> >>> for each run are 1.5 million
> >>>
> >>> Master:
> >>> fio-lat-mater.png
> >>>
> >>> Master + patch:
> >>> fio-lat-patch.png
> >>> Combine them together:
> >>> fio-lat-mix.png
> >>>
> >>> Now we can see the master is even worse when we increase the iodepth,
> >>> which makes sense since the backing HDD is being stressed more hardly.
> >>>
> >>> *Below are the cache stats changing during the run:*
> >>> Master:
> >>> bcache-stats-master.png
> >>>
> >>> Master + the patch:
> >>> bcache-stats-patch.png
> >>>
> >>> That's all the testing done with 400GB NVME with 512B block size.
> >>>
> >>> Coly, do you want me to continue the same testing on 1TB nvme with
> >>> different block size ?
> >>> or is it ok to skip the 1TB testing and continue the test with 400GB
> >>> NVME but with different block size?
> >>> feel free to let me know any other test scenarios that we should cover
> >>> here.
> >>
> >> Yes please, more testing is desired for performance improvement. So far
> >> I don't see performance number for real high work load yet.
> >>
> >> Thanks.
> >>
> >> Coly Li
> >>
>