On 2020/11/10 12:19, Dongdong Tao wrote: > [Sorry again for the SPAM detection] > > Thank you the reply Coly! > > I agree that this patch is not a final solution for fixing the > fragmentation issue, but more like a workaround to alleviate this > problem. > So, part of my intention is to look for how upstream would like to fix > this issue. > > I've looked into the code of moving_gc part, as well as did some > debug/test, unfortunately, I think it's not the solution for this > issue also. > Because movnig_gc will not just move the dirty cache, but also the > clean cache, so seems the purpose of moving_gc > is trying to move the data (dirty and clean) from those relatively > empty buckets to some new buckets, so that to reclaim the original > buckets. > For this purpose, I guess moving gc was more useful at the time when > we usually don't have large nvme devices. > > Let's get back to the problem I have, the problem that I'm trying to > fix is that you might have lots of buckets (Say 70 percent) that are > all being fully consumed, > while those buckets only contain very few dirty data (Say 10 percent > ), since gc can't reclaim a bucket which contains any dirty data, so > the worst situation > is that the cache_availability_percent can drop under 30 percent which > will make all the write op can't perform in a writeback mode, thus > kill the performance of writes. > > So, unlike the moving_gc, I only want to move dirty data around (as > you've suggested :)), but I don't think it's a good idea to change the > behavior of moving_gc. > My current idea is to implement a compaction thread that triggers the > dirty data compaction only under some certain circumstances (like when > the fragmentaion and dirty buckets are both high), and this thread can > be turned on/off based on an extra option, so that people can keep the > original behavior if they want. > > This is a rough idea now, could you please let me know if the above > thought makes sense to you or any other suggestions will be > appreciated! > I also understand the hardest part is making sure the general bcache > performance and functionality still look sane, > so it might require much more time to do it and it's more likely a > feature atm. > > How to reproduce and observe this issue: > This issue is very easy to repreduce by running below fio command > against a writeback mode bcache deivce: > > fio --name=random-writers --filename=/dev/bcache0 --ioengine=libaio > --iodepth=4 --rw=randrw --rate_iops=95,5 --bs=4k --direct=1 > --numjobs=4 > > Note that the key option to reproduce this issue here is > "rate_iops=95,5", so that you will have 95 percent read and only 5 > percent write, this is to make sure > one bucket only contains very few dirty data. > Also, it's faster to reproduce this with a small cache device, I use > 1GB cache, but it's same for bigger cache device, just a matter of > time. > > We can observe this issue by monitoring bcache stats "data_dirty" and > "cache_available_percent", after the cache_available_percent dropped > to 30 percent, > we can observe the write performance is hugely degraded by below > bpftrace script: > --- > #!/usr/bin/env bpftrace > > #include <linux/bio.h> > > kprobe:cached_dev_make_request > { > @start[arg1] = nsecs; > } > > kprobe:bio_endio /@start[arg0]/ > { > if(((struct bio *)arg0)->bi_opf & 1) { > @write = hist(nsecs - @start[arg0]); delete(@start[arg0]); > } > else { > @read = hist(nsecs - @start[arg0]); delete(@start[arg0]); > } > } > --- > > To run this script: > Save above bpftrace file to bcache_io_lat.bt, then run it with chmod > +x bcache_io_lat.bt & ./bcache_io_lat.bt > > By the way, we mainly hit this issue on ceph, the fio reproducer is > just an easy way to reproduce it. > Hi Dongdong, I know this situation, this is not the first time it is mentioned. What is the performance number that your patch gains ? I wanted to see "observable and reproducible performance number", especially the latency and IOPS for regular I/O requests. Thanks. Coly Li > On Fri, Nov 6, 2020 at 12:32 AM Coly Li <colyli@xxxxxxx> wrote: >> >> On 2020/11/3 20:42, Dongdong Tao wrote: >>> From: dongdong tao <dongdong.tao@xxxxxxxxxxxxx> >>> >>> Current way to calculate the writeback rate only considered the >>> dirty sectors, this usually works fine when the fragmentation >>> is not high, but it will give us unreasonable small rate when >>> we are under a situation that very few dirty sectors consumed >>> a lot dirty buckets. In some case, the dirty bucekts can reached >>> to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) noteven >>> reached the writeback_percent, the writeback rate will still >>> be the minimum value (4k), thus it will cause all the writes to be >>> stucked in a non-writeback mode because of the slow writeback. >>> >>> This patch will try to accelerate the writeback rate when the >>> fragmentation is high. It calculate the propotional_scaled value >>> based on below: >>> (dirty_sectors / writeback_rate_p_term_inverse) * fragment >>> As we can see, the higher fragmentation will result a larger >>> proportional_scaled value, thus cause a larger writeback rate. >>> The fragment value is calculated based on below: >>> (dirty_buckets * bucket_size) / dirty_sectors >>> If you think about it, the value of fragment will be always >>> inside [1, bucket_size]. >>> >>> This patch only considers the fragmentation when the number of >>> dirty_buckets reached to a dirty threshold(configurable by >>> writeback_fragment_percent, default is 50), so bcache will >>> remain the original behaviour before the dirty buckets reached >>> the threshold. >>> >>> Signed-off-by: dongdong tao <dongdong.tao@xxxxxxxxxxxxx> >> >> Hi Dongdong, >> >> Change the writeback rate does not effect the real throughput indeed, >> your change is just increasing the upper limit hint of the writeback >> throughput, the bottle neck is spinning drive for random I/O. >> >> A good direction should be the moving gc. If the moving gc may work >> faster, the situation you mentioned above could be relaxed a lot. >> >> I will NACK this patch unless you may have a observable and reproducible >> performance number. >> >> Thanks. >> >> Coly Li