[Sorry for the Spam] Here is the testing result on 400G NVME, I'll get a 1TB NVME later https://docs.google.com/document/d/1MmZBWfLRIX7_NfX4tGpWxOj31oIN5NuhNfCFzrIAP48/edit?usp=sharing On Tue, Jan 5, 2021 at 12:33 PM Coly Li <colyli@xxxxxxx> wrote: > > On 1/5/21 11:44 AM, Dongdong Tao wrote: > > Hey Coly, > > > > This is the second version of the patch, please allow me to explain a > > bit for this patch: > > > > We accelerate the rate in 3 stages with different aggressiveness, the > > first stage starts when dirty buckets percent reach above > > BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW(50), the second is > > BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID(57) and the third is > > BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH(64). By default the first stage > > tries to writeback the amount of dirty data in one bucket (on average) > > in (1 / (dirty_buckets_percent - 50)) second, the second stage tries to > > writeback the amount of dirty data in one bucket in (1 / > > (dirty_buckets_percent - 57)) * 200 millisecond. The third stage tries > > to writeback the amount of dirty data in one bucket in (1 / > > (dirty_buckets_percent - 64)) * 20 millisecond. > > > > As we can see, there are two writeback aggressiveness increasing > > strategies, one strategy is with the increasing of the stage, the first > > stage is the easy-going phase whose initial rate is trying to write back > > dirty data of one bucket in 1 second, the second stage is a bit more > > aggressive, the initial rate tries to writeback the dirty data of one > > bucket in 200 ms, the last stage is even more, whose initial rate tries > > to writeback the dirty data of one bucket in 20 ms. This makes sense, > > one reason is that if the preceding stage couldn’t get the fragmentation > > to a fine stage, then the next stage should increase the aggressiveness > > properly, also it is because the later stage is closer to the > > bch_cutoff_writeback_sync. Another aggressiveness increasing strategy is > > with the increasing of dirty bucket percent within each stage, the first > > strategy controls the initial writeback rate of each stage, while this > > one increases the rate based on the initial rate, which is initial_rate > > * (dirty bucket percent - BCH_WRITEBACK_FRAGMENT_THRESHOLD_X). > > > > The initial rate can be controlled by 3 parameters > > writeback_rate_fp_term_low, writeback_rate_fp_term_mid, > > writeback_rate_fp_term_high, they are default 1, 5, 50, users can adjust > > them based on their needs. > > > > The reason that I choose 50, 57, 64 as the threshold value is because > > the GC must be triggered at least once during each stage due to the > > “sectors_to_gc” being set to 1/16 (6.25 %) of the total cache size. So, > > the hope is that the first and second stage can get us back to good > > shape in most situations by smoothly writing back the dirty data without > > giving too much stress to the backing devices, but it might still enter > > the third stage if the bucket consumption is very aggressive. > > > > This patch use (dirty / dirty_buckets) * fp_term to calculate the rate, > > this formula means that we want to writeback (dirty / dirty_buckets) in > > 1/fp_term second, fp_term is calculated by above aggressiveness > > controller, “dirty” is the current dirty sectors, “dirty_buckets” is the > > current dirty buckets, so (dirty / dirty_buckets) means the average > > dirty sectors in one bucket, the value is between 0 to 1024 for the > > default setting, so this formula basically gives a hint that to reclaim > > one bucket in 1/fp_term second. By using this semantic, we can have a > > lower writeback rate when the amount of dirty data is decreasing and > > overcome the fact that dirty buckets number is always increasing unless > > GC happens. > > > > *Compare to the first patch: > > *The first patch is trying to write back all the data in 40 seconds, > > this will result in a very high writeback rate when the amount of dirty > > data is big, this is mostly true for the large cache devices. The basic > > problem is that the semantic of this patch is not ideal, because we > > don’t really need to writeback all dirty data in order to solve this > > issue, and the instant large increase of the rate is something I feel we > > should better avoid (I like things to be smoothly changed unless no > > choice: )). > > > > Before I get to this new patch(which I believe should be optimal for me > > atm), there have been many tuning/testing iterations, eg. I’ve tried to > > tune the algorithm to writeback ⅓ of the dirty data in a certain amount > > of seconds, writeback 1/fragment of the dirty data in a certain amount > > of seconds, writeback all the dirty data only in those error_buckets > > (error buckets = dirty buckets - 50% of the total buckets) in a certain > > amount of time. However, those all turn out not to be ideal, only the > > semantic of the patch makes much sense for me and allows me to control > > the rate in a more precise way. > > > > *Testing data: > > *I'll provide the visualized testing data in the next couple of days > > with 1TB NVME devices cache but with HDD as backing device since it's > > what we mostly used in production env. > > I have the data for 400GB NVME, let me prepare it and take it for you to > > review. > [snipped] > > Hi Dongdong, > > Thanks for the update and continuous effort on this idea. > > Please keep in mind the writeback rate is just a advice rate for the > writeback throughput, in real workload changing the writeback rate > number does not change writeback throughput obviously. > > Currently I feel this is an interesting and promising idea for your > patch, but I am not able to say whether it may take effect in real > workload, so we do need convinced performance data on real workload and > configuration. > > Of course I may also help on the benchmark, but my to-do list is long > enough and it may take a very long delay time. > > Thanks. > > Coly Li