Re: Writeback cache all used.

Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx> · Mon, 8 May 2023 17:29:54 -0700 (PDT)

On Thu, 4 May 2023, Adriano Silva wrote:

> Hi Coly,
> 
> If I can help you with anything, please let me know.
> 
> Thanks!
> 
> 
> Guys, can I take advantage and ask one more question? If you prefer, 
> I'll open another topic, but as it has something to do with the subject 
> discussed here, I'll ask for now right here.
> 
> I decided to make (for now) a bash script to change the cache parameters 
> trying a temporary workaround to solve the issue manually in at least 
> one of my clusters.
> 
> So: I use in production cache_mode as writeback, writeback_percent to 2 
> (I think low is safer and faster for a flush, while staying at 10 hasn't 
> shown better performance in my case - i am wrong?). I use discard as 
> false, as it is slow to discard each bucket that is modified (I believe 
> the discard would need to be by large batches of free buckets). I use 0 
> (zero) in sequence_cutoff because using the bluestore file system (from 
> ceph), it seems to me that using any other value in this variable, 
> bcache understands everything as sequential and bypasses it to the back 
> disk. I also use congested_read_threshold_us and 
> congested_write_threshold_us to 0 (zero) as it seems to give slightly 
> better performance, lower latency. I always use rotational as 1, never 
> change it. They always say that for Ceph it works better, I've been 
> using it ever since. I put these parameters at system startup.
> 
> So, I decided at 01:00 that I'm going to run a bash script to change 
> these parameters in order to clear the cache and use it to back up my 
> data from databases and others. So, I change writeback_percent to 0 
> (zero) for it to clean all the dirt from the cache. Then I keep checking 
> the status until it's "cleared". I then pass the cache_mode to 
> writethrough. In the sequence I confirm if the cache remains "clean". 
> Being "clean", I change cache_mode to "none" and then comes the 
> following line:
> 
> echo $cache_cset > /sys/block/$bcache_device/bcache/cache/unregister

These are my notes for hot-removal of a cache.  You need to detach and 
then unregister:

~] bcache-super-show /dev/sdz1 # cache device
[...]
cset.uuid		3db83b23-1af7-43a6-965d-c277b402b16a

~] echo 3db83b23-1af7-43a6-965d-c277b402b16a > /sys/block/bcache0/bcache/detach
~] watch cat /sys/block/bcache0/bcache/dirty_data # wait until 0
~] echo 1 > /sys/fs/bcache/3db83b23-1af7-43a6-965d-c277b402b16a/unregister

> Here ends the script that runs at 01:00 am.
> 
> So, then I perform backups of my data, without the reading of this data 
> going through and being all written in my cache. (Am I thinking 
> correctly?)
> 
> Users will continue to use the system normally, however the system will 
> be slower because the Ceph OSD will be working on top of the bcache 
> device without having a cache. But a lower performance at that time, for 
> my case, is acceptable at that time.
> 
> After the backup is complete, at 05:00 am I run the following sequence:
> 
>           wipefs -a /dev/nvme0n1p1
>           sleep 1
>           blkdiscard /dev/nvme0n1p1
>           sleep 1
>           makebcache=$(make-bcache --wipe-bcache -w 4k --bucket 256K -C /dev/$cache_device)
>           sleep 1 cache_cset=$(bcache-super-show /dev/$cache_device | grep cset | awk '{ print $2 }')
>           echo $cache_cset > /sys/block/bcache0/bcache/attach
> 
> One thing to point out here is the size of the bucket I use (256K) which 
> I defined according to the performance tests I did.

How do you test performance, is it automated?

> While I didn't notice any big performance differences during these 
> tests, I thought 256K was the best performing smallest block I got with 
> my NVMe device, which is an enterprise device (with non-volatile cache), 
> but I don't have information about the size minimum erasure block. I did 
> not find this information about the smallest erase block of this device 
> anywhere. I looked in several ways, the manufacturer didn't inform me, 
> the nvme-cli tool didn't show me either. Would 256 really be a good 
> number to use?

If you can automate performance testing, then you could use something like 
Simplex [1] to optimize the following values for your infrastructure:

	- bucket size # `make-bcache --bucket N`
	- /sys/block/<nvme>/queue/nr_requests	
	- /sys/block/<nvme>/queue/io_poll_delay
	- /sys/block/<nvme>/queue/max_sectors_kb
	- /sys/block/<bdev>/queue/nr_requests
	- other IO tunables? maybe: /sys/block/bcache0/bcache/
		writeback_percent
		writeback_rate
		writeback_delay

	Selfless plug: I've always wanted to tune Linux with Simplex, but 
	haven't gotten to it: [1] https://metacpan.org/pod/PDL::Opt::Simplex::Simple

> Anyway, after attaching the cache again, I return the parameters to what 
> I have been using in production:
> 
>           echo writeback > /sys/block/$bcache_device/bcache/cache_mode
>           echo 1 > /sys/devices/virtual/block/$bcache_device/queue/rotational
>           echo 1 > /sys/fs/bcache/$cache_cset/internal/gc_after_writeback
>           echo 1 > /sys/block/$bcache_device/bcache/cache/internal/trigger_gc
>           echo 2 > /sys/block/$bcache_device/bcache/writeback_percent
>           echo 0 > /sys/fs/bcache/$cache_cset/cache0/discard
>           echo 0 > /sys/block/$bcache_device/bcache/sequential_cutoff
>           echo 0 > /sys/fs/bcache/$cache_cset/congested_read_threshold_us
>           echo 0 > /sys/fs/bcache/$cache_cset/congested_write_threshold_us
>
> 
> I created the scripts in a test environment and it seems to have worked as expected.

Looks good.  I might try those myself...

> My question: Would it be a correct way to temporarily solve the problem 
> as a palliative? 

If it works, then I don't see a problem except that you are evicting your 
cache.  

> Is it safe to do it this way with a mounted file 
> system, with files in use by users and databases in working order? 

Yes (at least by design, it is safe to detach bcache cdevs).  We have done 
it many times in production.

> Are there greater risks involved in putting this into production? Do you 
> see any problems or anything that could be different?

Only that you are turning on options that have not been tested much, 
simply because they are not default.  You might hit a bug... but if so, 
then report it to be fixed!

-Eric

> 
> Thanks!
> 
> 
> 
> Em quinta-feira, 4 de maio de 2023 às 01:56:23 BRT, Coly Li <colyli@xxxxxxx> escreveu: 
> 
> 
> 
> 
> 
> 
> 
> > 2023年5月3日 04:34，Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx> 写道：
> > 
> > On Thu, 20 Apr 2023, Adriano Silva wrote:
> >> I continue to investigate the situation. There is actually a performance 
> >> gain when the bcache device is only half filled versus full. There is a 
> >> reduction and greater stability in the latency of direct writes and this 
> >> improves my scenario.
> > 
> > Hi Coly, have you been able to look at this?
> > 
> > This sounds like a great optimization and Adriano is in a place to test 
> > this now and report his findings.
> > 
> > I think you said this should be a simple hack to add early reclaim, so 
> > maybe you can throw a quick patch together (even a rough first-pass with 
> > hard-coded reclaim values)
> > 
> > If we can get back to Adriano quickly then he can test while he has an 
> > easy-to-reproduce environment.  Indeed, this could benefit all bcache 
> > users.
> 
> My current to-do list on hand is a little bit long. Yes I’d like and plan to do it, but the response time cannot be estimated.
> 
> 
> Coly Li
> 
> 
> [snipped]
>