Re: Writeback cache all used.

Coly Li <colyli@xxxxxxx> · Mon, 10 Apr 2023 00:37:11 +0800

> 2023年4月6日 03:31，Adriano Silva <adriano_da_silva@xxxxxxxxxxxx> 写道：
> 
> Hello Coly.
> 
> Yes, the server is always on. I allowed it to stay on for more than 24 hours with zero disk I/O to the bcache device. The result is that there are no movements on the cache or data disks, nor on the bcache device as we can see:
> 
> root@pve-00-005:~# dstat -drt -D sdc,nvme0n1,bcache0
> --dsk/sdc---dsk/nvme0n1-dsk/bcache0 ---io/sdc----io/nvme0n1--io/bcache0 ----system----
>  read  writ: read  writ: read  writ| read  writ: read  writ: read  writ|     time     
>   54k  154k: 301k  221k: 223k  169k|0.67  0.54 :6.99  20.5 :6.77  12.3 |05-04 14:45:50
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:45:51
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:45:52
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:45:53
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:45:54
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:45:55
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:45:56
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:45:57
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:45:58
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:45:59
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:46:00
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:46:01
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:46:02
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:46:03
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:46:04
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:46:05
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:46:06
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 14:46:07
> 
> It can stay like that for hours without showing any, zero data flow, either read or write on any of the devices.
> 
> root@pve-00-005:~# cat /sys/block/bcache0/bcache/state
> clean
> root@pve-00-005:~#
> 
> But look how strange, in another command (priority_stats), it shows that there is still 1% of dirt in the cache. And 0% unused cache space. Even after hours of server on and completely idle:
> 
> root@pve-00-005:~# cat /sys/devices/pci0000:80/0000:80:01.1/0000:82:00.0/nvme/nvme0/nvme0n1/nvme0n1p1/bcache/priority_stats
> Unused:         0%
> Clean:          98%
> Dirty:          1%
> Metadata:       0%
> Average:        1137
> Sectors per Q:  36245232
> Quantiles:      [12 26 42 60 80 127 164 237 322 426 552 651 765 859 948 1030 1176 1264 1370 1457 1539 1674 1786 1899 1989 2076 2232 2350 2471 2594 2764]
> 
> Why is this happening?
> 
>> Can you try to write 1 to cache set sysfs file 
>> gc_after_writeback? 
>> When it is set, a gc will be waken up automatically after 
>> all writeback accomplished. Then most of the clean cache 
>> might be shunk and the B+tree nodes will be deduced 
>> quite a lot.
> 
> Would this be the command you ask me for?
> 
> root@pve-00-005:~# echo 1 > /sys/fs/bcache/a18394d8-186e-44f9-979a-8c07cb3fbbcd/internal/gc_after_writeback
> 
> If this command is correct, I already advance that it did not give the expected result. The Cache continues with 100% of the occupied space. Nothing has changed despite the cache being cleaned and having written the command you recommended. Let's see:
> 
> root@pve-00-005:~# cat /sys/block/bcache0/bcache/cache/cache0/priority_stats
> Unused:         0%
> Clean:          98%
> Dirty:          1%
> Metadata:       0%
> Average:        1137
> Sectors per Q:  36245232
> Quantiles:      [12 26 42 60 80 127 164 237 322 426 552 651 765 859 948 1030 1176 1264 1370 1457 1539 1674 1786 1899 1989 2076 2232 2350 2471 2594 2764]
> 
> But if there was any movement on the disks after the command, I couldn't detect it:
> 
> root@pve-00-005:~# dstat -drt -D sdc,nvme0n1,bcache0
> --dsk/sdc---dsk/nvme0n1-dsk/bcache0 ---io/sdc----io/nvme0n1--io/bcache0 ----system----
>  read  writ: read  writ: read  writ| read  writ: read  writ: read  writ|     time     
>   54k  153k: 300k  221k: 222k  169k|0.67  0.53 :6.97  20.4 :6.76  12.3 |05-04 15:28:57
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:28:58
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:28:59
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:29:00
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:29:01
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:29:02
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:29:03
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:29:04^C
> root@pve-00-005:~#
> 
> Why were there no changes?

Thanks for the above information. The result is unexpected from me. Let me check whether the B+tree nodes are not shrunk, this is something should be improved. And when the write erase time matters for write requests, normally it is the condition that heavy write loads coming. In such education, the LBA of the collected buckets might be allocated out very soon even before the SSD controller finishes internal write-erasure by the hint of discard/trim. Therefore issue discard/trim right before writing to this LBA doesn’t help on any write performance and involves in extra unnecessary workload into the SSD controller.

And for nowadays SATA/NVMe SSDs, with your workload described above, the write performance drawback can be almost ignored

> 
>> Currently there is no such option for limit bcache 
>> in-memory B+tree nodes cache occupation, but when I/O 
>> load reduces, such memory consumption may drop very 
>> fast by the reaper from system memory management 
>> code. So it won’t be a problem. Bcache will try to use any 
>> possible memory for B+tree nodes cache if it is 
>> necessary, and throttle I/O performance to return these 
>> memory back to memory management code when the 
>> available system memory is low. By default, it should 
>> work well and nothing should be done from user.
> 
> I've been following the server's operation a lot and I've never seen less than 50 GB of free RAM memory. Let's see: 
> 
> root@pve-00-005:~# free               total        used        free      shared  buff/cache   available
> Mem:       131980688    72670448    19088648       76780    40221592    57335704
> Swap:              0           0           0
> root@pve-00-005:~#
> 
> There is always plenty of free RAM, which makes me ask: Could there really be a problem related to a lack of RAM?

No, this is not because of insufficient memory. From your information the memory is enough.

> 
>> Bcache doesn’t issue trim request proactively. 
>> [...]
>> In run time, bcache code only forward the trim request to backing device (not cache device).
> 
> Wouldn't it be advantageous if bcache sent TRIM (discard) to the cache temporarily? I believe flash drives (SSD or NVMe) that need TRIM to maintain top performance are typically used as a cache for bcache. So, I think that if the TRIM command was used regularly by bcache, in the background (only for clean and free buckets), with a controlled frequency, or even if executed by a manually triggered by the user background task (always only for clean and free buckets), it could help to reduce the write latency of the cache. I believe it would help the writeback efficiency a lot. What do you think about this?

There was such attempt but indeed doesn’t help at all. The reason is, bcache can only know which bucket can be discarded when it is handled by garbage collection. 

> 
> Anyway, this issue of the free buckets not appearing is keeping me awake at night. Could it be a problem with my Kernel version (Linux 5.15)?
> 
> As I mentioned before, I saw in the bcache documentation (https://docs.kernel.org/admin-guide/bcache.html) a variable (freelist_percent) that was supposed to control a minimum rate of free buckets. Could it be a solution? I don't know. But in practice, I didn't find this variable in my system (could it be because of the OS version?)

Let me look into this…

Thanks.

Coly Li