Re: [PATCH] bcache: consider the fragmentation when update the writeback rate

Coly Li <colyli@xxxxxxx> · Wed, 11 Nov 2020 16:33:39 +0800

On 2020/11/10 12:19, Dongdong Tao wrote:
> [Sorry again for the SPAM detection]
> 
> Thank you the reply Coly!
> 
> I agree that this patch is not a final solution for fixing the
> fragmentation issue, but more like a workaround to alleviate this
> problem.
> So, part of my intention is to look for how upstream would like to fix
> this issue.
> 
> I've looked into the code of moving_gc part, as well as did some
> debug/test, unfortunately, I think it's not the solution for this
> issue also.
> Because movnig_gc will not just move the dirty cache, but also the
> clean cache, so seems the purpose of moving_gc
> is trying to move the data (dirty and clean) from those relatively
> empty buckets to some new buckets, so that to reclaim the original
> buckets.
> For this purpose, I guess moving gc was more useful at the time when
> we usually don't have large nvme devices.
> 
> Let's get back to the problem I have, the problem that I'm trying to
> fix is that you might have lots of buckets (Say 70 percent) that are
> all being fully consumed,
> while those buckets only contain very few dirty data (Say 10 percent
> ), since gc can't reclaim a bucket which contains any dirty data, so
> the worst situation
> is that the cache_availability_percent can drop under 30 percent which
> will make all the write op can't perform in a writeback mode, thus
> kill the performance of writes.
> 
> So, unlike the moving_gc, I only want to move dirty data around (as
> you've suggested :)), but I don't think it's a good idea to change the
> behavior of moving_gc.
> My current idea is to implement a compaction thread that triggers the
> dirty data compaction only under some certain circumstances (like when
> the fragmentaion and dirty buckets are both high), and this thread can
> be turned on/off based on an extra option, so that people can keep the
> original behavior if they want.
> 
> This is a rough idea now, could you please let me know if the above
> thought makes sense to you or any other suggestions will be
> appreciated!
> I also understand the hardest part is making sure the general bcache
> performance and functionality still look sane,
> so it might require much more time to do it and it's more likely a
> feature atm.
> 
> How to reproduce and observe this issue:
> This issue is very easy to repreduce by running below fio command
> against a writeback mode bcache deivce:
> 
> fio --name=random-writers --filename=/dev/bcache0 --ioengine=libaio
> --iodepth=4 --rw=randrw --rate_iops=95,5 --bs=4k --direct=1
> --numjobs=4
> 
> Note that the key option to reproduce this issue here is
> "rate_iops=95,5", so that you will have 95 percent read and only 5
> percent write, this is to make sure
> one bucket only contains very few dirty data.
> Also, it's faster to reproduce this with a small cache device, I use
> 1GB cache, but it's same for bigger cache device, just a matter of
> time.
> 
> We can observe this issue by monitoring bcache stats "data_dirty" and
> "cache_available_percent", after the cache_available_percent dropped
> to 30 percent,
> we can observe the write performance is hugely degraded by below
> bpftrace script:
> ---
> #!/usr/bin/env bpftrace
> 
> #include <linux/bio.h>
> 
> kprobe:cached_dev_make_request
> {
>    @start[arg1] = nsecs;
> }
> 
> kprobe:bio_endio /@start[arg0]/
> {
>  if(((struct bio *)arg0)->bi_opf & 1) {
>     @write = hist(nsecs - @start[arg0]); delete(@start[arg0]);
>  }
>  else {
>     @read = hist(nsecs - @start[arg0]); delete(@start[arg0]);
>  }
> }
> ---
> 
> To run this script：
> Save above bpftrace file to bcache_io_lat.bt, then run it with chmod
> +x bcache_io_lat.bt & ./bcache_io_lat.bt
> 
> By the way, we mainly hit this issue on ceph, the fio reproducer is
> just an easy way to reproduce it.
> 

Hi Dongdong,

I know this situation, this is not the first time it is mentioned.

What is the performance number that your patch gains ? I wanted to see
"observable and reproducible performance number", especially the latency
 and IOPS for regular I/O requests.

Thanks.

Coly Li

> On Fri, Nov 6, 2020 at 12:32 AM Coly Li <colyli@xxxxxxx> wrote:
>>
>> On 2020/11/3 20:42, Dongdong Tao wrote:
>>> From: dongdong tao <dongdong.tao@xxxxxxxxxxxxx>
>>>
>>> Current way to calculate the writeback rate only considered the
>>> dirty sectors, this usually works fine when the fragmentation
>>> is not high, but it will give us unreasonable small rate when
>>> we are under a situation that very few dirty sectors consumed
>>> a lot dirty buckets. In some case, the dirty bucekts can reached
>>> to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) noteven
>>> reached the writeback_percent, the writeback rate will still
>>> be the minimum value (4k), thus it will cause all the writes to be
>>> stucked in a non-writeback mode because of the slow writeback.
>>>
>>> This patch will try to accelerate the writeback rate when the
>>> fragmentation is high. It calculate the propotional_scaled value
>>> based on below:
>>> (dirty_sectors / writeback_rate_p_term_inverse) * fragment
>>> As we can see, the higher fragmentation will result a larger
>>> proportional_scaled value, thus cause a larger writeback rate.
>>> The fragment value is calculated based on below:
>>> (dirty_buckets *  bucket_size) / dirty_sectors
>>> If you think about it, the value of fragment will be always
>>> inside [1, bucket_size].
>>>
>>> This patch only considers the fragmentation when the number of
>>> dirty_buckets reached to a dirty threshold(configurable by
>>> writeback_fragment_percent, default is 50), so bcache will
>>> remain the original behaviour before the dirty buckets reached
>>> the threshold.
>>>
>>> Signed-off-by: dongdong tao <dongdong.tao@xxxxxxxxxxxxx>
>>
>> Hi Dongdong,
>>
>> Change the writeback rate does not effect the real throughput indeed,
>> your change is just increasing the upper limit hint of the writeback
>> throughput, the bottle neck is spinning drive for random I/O.
>>
>> A good direction should be the moving gc. If the moving gc may work
>> faster, the situation you mentioned above could be relaxed a lot.
>>
>> I will NACK this patch unless you may have a observable and reproducible
>> performance number.
>>
>> Thanks.
>>
>> Coly Li