Re: Writeback cache all used.

Adriano Silva <adriano_da_silva@xxxxxxxxxxxx> · Thu, 4 May 2023 14:34:33 +0000 (UTC)

Hi Coly,

If I can help you with anything, please let me know.

Thanks!

Guys, can I take advantage and ask one more question? If you prefer, I'll open another topic, but as it has something to do with the subject discussed here, I'll ask for now right here.

I decided to make (for now) a bash script to change the cache parameters trying a temporary workaround to solve the issue manually in at least one of my clusters.

So: I use in production cache_mode as writeback, writeback_percent to 2 (I think low is safer and faster for a flush, while staying at 10 hasn't shown better performance in my case - i am wrong?). I use discard as false, as it is slow to discard each bucket that is modified (I believe the discard would need to be by large batches of free buckets). I use 0 (zero) in sequence_cutoff because using the bluestore file system (from ceph), it seems to me that using any other value in this variable, bcache understands everything as sequential and bypasses it to the back disk. I also use congested_read_threshold_us and congested_write_threshold_us to 0 (zero) as it seems to give slightly better performance, lower latency. I always use rotational as 1, never change it. They always say that for Ceph it works better, I've been using it ever since. I put these parameters at system startup.

So, I decided at 01:00 that I'm going to run a bash script to change these parameters in order to clear the cache and use it to back up my data from databases and others. So, I change writeback_percent to 0 (zero) for it to clean all the dirt from the cache. Then I keep checking the status until it's "cleared". I then pass the cache_mode to writethrough.
In the sequence I confirm if the cache remains "clean". Being "clean", I change cache_mode to "none" and then comes the following line:

echo $cache_cset > /sys/block/$bcache_device/bcache/cache/unregister

Here ends the script that runs at 01:00 am.

So, then I perform backups of my data, without the reading of this data going through and being all written in my cache. (Am I thinking correctly?)

Users will continue to use the system normally, however the system will be slower because the Ceph OSD will be working on top of the bcache device without having a cache. But a lower performance at that time, for my case, is acceptable at that time.

After the backup is complete, at 05:00 am I run the following sequence:

          wipefs -a /dev/nvme0n1p1
          sleep 1
          blkdiscard /dev/nvme0n1p1
          sleep 1
          makebcache=$(make-bcache --wipe-bcache -w 4k --bucket 256K -C /dev/$cache_device)
          sleep 1 cache_cset=$(bcache-super-show /dev/$cache_device | grep cset | awk '{ print $2 }')
          echo $cache_cset > /sys/block/bcache0/bcache/attach

One thing to point out here is the size of the bucket I use (256K) which I defined according to the performance tests I did. While I didn't notice any big performance differences during these tests, I thought 256K was the best performing smallest block I got with my NVMe device, which is an enterprise device (with non-volatile cache), but I don't have information about the size minimum erasure block. I did not find this information about the smallest erase block of this device anywhere. I looked in several ways, the manufacturer didn't inform me, the nvme-cli tool didn't show me either. Would 256 really be a good number to use?

Anyway, after attaching the cache again, I return the parameters to what I have been using in production:

          echo writeback > /sys/block/$bcache_device/bcache/cache_mode
          echo 1 > /sys/devices/virtual/block/$bcache_device/queue/rotational
          echo 1 > /sys/fs/bcache/$cache_cset/internal/gc_after_writeback
          echo 1 > /sys/block/$bcache_device/bcache/cache/internal/trigger_gc
          echo 2 > /sys/block/$bcache_device/bcache/writeback_percent
          echo 0 > /sys/fs/bcache/$cache_cset/cache0/discard
          echo 0 > /sys/block/$bcache_device/bcache/sequential_cutoff
          echo 0 > /sys/fs/bcache/$cache_cset/congested_read_threshold_us
          echo 0 > /sys/fs/bcache/$cache_cset/congested_write_threshold_us

I created the scripts in a test environment and it seems to have worked as expected.

My question: Would it be a correct way to temporarily solve the problem as a palliative? Is it safe to do it this way with a mounted file system, with files in use by users and databases in working order? Are there greater risks involved in putting this into production? Do you see any problems or anything that could be different?

Thanks!

Em quinta-feira, 4 de maio de 2023 às 01:56:23 BRT, Coly Li <colyli@xxxxxxx> escreveu: 

> 2023年5月3日 04:34，Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx> 写道：
> 
> On Thu, 20 Apr 2023, Adriano Silva wrote:
>> I continue to investigate the situation. There is actually a performance 
>> gain when the bcache device is only half filled versus full. There is a 
>> reduction and greater stability in the latency of direct writes and this 
>> improves my scenario.
> 
> Hi Coly, have you been able to look at this?
> 
> This sounds like a great optimization and Adriano is in a place to test 
> this now and report his findings.
> 
> I think you said this should be a simple hack to add early reclaim, so 
> maybe you can throw a quick patch together (even a rough first-pass with 
> hard-coded reclaim values)
> 
> If we can get back to Adriano quickly then he can test while he has an 
> easy-to-reproduce environment.  Indeed, this could benefit all bcache 
> users.

My current to-do list on hand is a little bit long. Yes I’d like and plan to do it, but the response time cannot be estimated.

Coly Li

[snipped]