On 1/8/21 4:30 PM, Dongdong Tao wrote: > Hi Coly, > > They are captured with the same time length, the meaning of the > timestamp and the time unit on the x-axis are different. > (Sorry, I should have clarified this right after the chart) > > For the latency chart: > The timestamp is the relative time since the beginning of the > benchmark, so the start timestamp is 0 and the unit is based on > millisecond > > For the dirty data and cache available percent chart: > The timestamp is the UNIX timestamp, the time unit is based on second, > I capture the stats every 5 seconds with the below script: > --- > #!/bin/sh > while true; do echo "`date +%s`, `cat > /sys/block/bcache0/bcache/dirty_data`, `cat > /sys/block/bcache0/bcache/cache/cache_available_percent`, `cat > /sys/block/bcache0/bcache/writeback_rate`" >> $1; sleep 5; done; > --- > > Unfortunately, I can't easily make them using the same timestamp, but > I guess I can try to convert the UNIX timestamp to the relative time > like the first one. > But If we ignore the value of the X-axis, we can still roughly > compare them by using the length of the X-axis since they have the > same time length, > and we can see that the Master's write start hitting the backing > device when the cache_available_percent dropped to around 30. Copied, thanks for the explanation. The chart for single thread with io depth 1 is convinced IMHO :-) One more question, the benchmark is about a single I/O thread with io depth 1, which is not typical condition for real workload. Do you have plan to test the latency and IOPS for multiple threads with larger I/O depth ? Thanks. Coly Li > > On Fri, Jan 8, 2021 at 12:06 PM Coly Li <colyli@xxxxxxx> wrote: >> >> On 1/7/21 10:55 PM, Dongdong Tao wrote: >>> Hi Coly, >>> >>> >>> Thanks for the reminder, I understand that the rate is only a hint of >>> the throughput, it’s a value to calculate the sleep time between each >>> round of keys writeback, the higher the rate, the shorter the sleep >>> time, most of the time this means the more dirty keys it can writeback >>> in a certain amount of time before the hard disk running out of speed. >>> >>> >>> Here is the testing data that run on a 400GB NVME + 1TB NVME HDD >>> >> >> Hi Dongdong, >> >> Nice charts :-) >> >>> Steps: >>> >>> 1. >>> >>> make-bcache -B <HDD> -C <NVME> --writeback >>> >>> 2. >>> >>> sudo fio --name=random-writers --filename=/dev/bcache0 >>> --ioengine=libaio --iodepth=1 --rw=randrw --blocksize=64k,8k >>> --direct=1 --numjobs=1 --write_lat_log=mix --log_avg_msec=10 >>>> The fio benchmark commands ran for about 20 hours. >>> >> >> The time lengths of first 3 charts are 7.000e+7, rested are 1.60930e+9. >> I guess the time length of the I/O latency chart is 1/100 of the rested. >> >> Can you also post the latency charts for 1.60930e+9 seconds? Then I can >> compare the latency with dirty data and available cache charts. >> >> >> Thanks. >> >> >> Coly Li >> >> >> >> >> >>> >>> Let’s have a look at the write latency first: >>> >>> Master: >>> >>> >>> >>> Master+the patch: >>> >>> Combine them together: >>> >>> Again, the latency (y-axis) is based on nano-second, x-axis is the >>> timestamp based on milli-second, as we can see the master latency is >>> obviously much higher than the one with my patch when the master bcache >>> hit the cutoff writeback sync, the master isn’t going to get out of this >>> cutoff writeback sync situation, This graph showed it already stuck at >>> the cutoff writeback sync for about 4 hours before I finish the testing, >>> it may still needs to stuck for days before it can get out this >>> situation itself. >>> >>> >>> Note that there are 1 million points for each , red represents master, >>> green represents mater+my patch. Most of them are overlapped with each >>> other, so it may look like this graph has more red points then green >>> after it hitting the cutoff, but simply it’s because the latency has >>> scaled to a bigger range which represents the HDD latency. >>> >>> >>> >>> Let’s also have a look at the bcache’s cache available percent and dirty >>> data percent. >>> >>> Master: >>> >>> Master+this patch: >>> >>> As you can see, this patch can avoid it hitting the cutoff writeback sync. >>> >>> >>> As to say the improvement for this patch against the first one, let’s >>> take a look at the writeback rate changing during the run. >>> >>> patch V1: >>> >>> >>> >>> Patch V2: >>> >>> >>> The Y-axis is the value of rate, the V1 is very aggressive as it jumps >>> instantly from a minimum 8 to around 10 million. And the patch V2 can >>> control the rate under 5000 during the run, and after the first round of >>> writeback, it can stay even under 2500, so this proves we don’t need to >>> be as aggressive as V1 to get out of the high fragment situation which >>> eventually causes all writes hitting the backing device. This looks very >>> reasonable for me now. >>> >>> Note that the fio command that I used is consuming the bucket quite >>> aggressively, so it had to hit the third stage which has the highest >>> aggressiveness, but I believe this is not true in a real production env, >>> real production env won’t consume buckets that aggressively, so I expect >>> stage 3 may not very often be needed to hit. >>> >>> >>> As discussed, I'll run multiple block size testing on at least 1TB NVME >>> device later. >>> But it might take some time. >>> >>> >>> Regards, >>> Dongdong >>> >>> On Tue, Jan 5, 2021 at 12:33 PM Coly Li <colyli@xxxxxxx >>> <mailto:colyli@xxxxxxx>> wrote: >>> >>> On 1/5/21 11:44 AM, Dongdong Tao wrote: >>> > Hey Coly, >>> > >>> > This is the second version of the patch, please allow me to explain a >>> > bit for this patch: >>> > >>> > We accelerate the rate in 3 stages with different aggressiveness, the >>> > first stage starts when dirty buckets percent reach above >>> > BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW(50), the second is >>> > BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID(57) and the third is >>> > BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH(64). By default the first stage >>> > tries to writeback the amount of dirty data in one bucket (on average) >>> > in (1 / (dirty_buckets_percent - 50)) second, the second stage >>> tries to >>> > writeback the amount of dirty data in one bucket in (1 / >>> > (dirty_buckets_percent - 57)) * 200 millisecond. The third stage tries >>> > to writeback the amount of dirty data in one bucket in (1 / >>> > (dirty_buckets_percent - 64)) * 20 millisecond. >>> > >>> > As we can see, there are two writeback aggressiveness increasing >>> > strategies, one strategy is with the increasing of the stage, the >>> first >>> > stage is the easy-going phase whose initial rate is trying to >>> write back >>> > dirty data of one bucket in 1 second, the second stage is a bit more >>> > aggressive, the initial rate tries to writeback the dirty data of one >>> > bucket in 200 ms, the last stage is even more, whose initial rate >>> tries >>> > to writeback the dirty data of one bucket in 20 ms. This makes sense, >>> > one reason is that if the preceding stage couldn’t get the >>> fragmentation >>> > to a fine stage, then the next stage should increase the >>> aggressiveness >>> > properly, also it is because the later stage is closer to the >>> > bch_cutoff_writeback_sync. Another aggressiveness increasing >>> strategy is >>> > with the increasing of dirty bucket percent within each stage, the >>> first >>> > strategy controls the initial writeback rate of each stage, while this >>> > one increases the rate based on the initial rate, which is >>> initial_rate >>> > * (dirty bucket percent - BCH_WRITEBACK_FRAGMENT_THRESHOLD_X). >>> > >>> > The initial rate can be controlled by 3 parameters >>> > writeback_rate_fp_term_low, writeback_rate_fp_term_mid, >>> > writeback_rate_fp_term_high, they are default 1, 5, 50, users can >>> adjust >>> > them based on their needs. >>> > >>> > The reason that I choose 50, 57, 64 as the threshold value is because >>> > the GC must be triggered at least once during each stage due to the >>> > “sectors_to_gc” being set to 1/16 (6.25 %) of the total cache >>> size. So, >>> > the hope is that the first and second stage can get us back to good >>> > shape in most situations by smoothly writing back the dirty data >>> without >>> > giving too much stress to the backing devices, but it might still >>> enter >>> > the third stage if the bucket consumption is very aggressive. >>> > >>> > This patch use (dirty / dirty_buckets) * fp_term to calculate the >>> rate, >>> > this formula means that we want to writeback (dirty / >>> dirty_buckets) in >>> > 1/fp_term second, fp_term is calculated by above aggressiveness >>> > controller, “dirty” is the current dirty sectors, “dirty_buckets” >>> is the >>> > current dirty buckets, so (dirty / dirty_buckets) means the average >>> > dirty sectors in one bucket, the value is between 0 to 1024 for the >>> > default setting, so this formula basically gives a hint that to >>> reclaim >>> > one bucket in 1/fp_term second. By using this semantic, we can have a >>> > lower writeback rate when the amount of dirty data is decreasing and >>> > overcome the fact that dirty buckets number is always increasing >>> unless >>> > GC happens. >>> > >>> > *Compare to the first patch: >>> > *The first patch is trying to write back all the data in 40 seconds, >>> > this will result in a very high writeback rate when the amount of >>> dirty >>> > data is big, this is mostly true for the large cache devices. The >>> basic >>> > problem is that the semantic of this patch is not ideal, because we >>> > don’t really need to writeback all dirty data in order to solve this >>> > issue, and the instant large increase of the rate is something I >>> feel we >>> > should better avoid (I like things to be smoothly changed unless no >>> > choice: )). >>> > >>> > Before I get to this new patch(which I believe should be optimal >>> for me >>> > atm), there have been many tuning/testing iterations, eg. I’ve >>> tried to >>> > tune the algorithm to writeback ⅓ of the dirty data in a certain >>> amount >>> > of seconds, writeback 1/fragment of the dirty data in a certain amount >>> > of seconds, writeback all the dirty data only in those error_buckets >>> > (error buckets = dirty buckets - 50% of the total buckets) in a >>> certain >>> > amount of time. However, those all turn out not to be ideal, only the >>> > semantic of the patch makes much sense for me and allows me to control >>> > the rate in a more precise way. >>> > >>> > *Testing data: >>> > *I'll provide the visualized testing data in the next couple of days >>> > with 1TB NVME devices cache but with HDD as backing device since it's >>> > what we mostly used in production env. >>> > I have the data for 400GB NVME, let me prepare it and take it for >>> you to >>> > review. >>> [snipped] >>> >>> Hi Dongdong, >>> >>> Thanks for the update and continuous effort on this idea. >>> >>> Please keep in mind the writeback rate is just a advice rate for the >>> writeback throughput, in real workload changing the writeback rate >>> number does not change writeback throughput obviously. >>> >>> Currently I feel this is an interesting and promising idea for your >>> patch, but I am not able to say whether it may take effect in real >>> workload, so we do need convinced performance data on real workload and >>> configuration. >>> >>> Of course I may also help on the benchmark, but my to-do list is long >>> enough and it may take a very long delay time. >>> >>> Thanks. >>> >>> Coly Li >>> >>