Re: Writeback cache all used.

Adriano Silva <adriano_da_silva@xxxxxxxxxxxx> · Fri, 7 Apr 2023 03:15:17 +0000 (UTC)

Hello Eric,

> The `gc_after_writeback=1` setting might not trigger until writeback
> finishes, but if writeback is already finished and there is no new IO then
> it may never trigger unless it is forced via `tigger_gc`

Yes, I tried both commands, but I didn't get the expected result and continued with (almost) no free cache.

After executing the two commands, some dirty data was cleaned and some free space was left in the cache. But almost insignificant.

However, it stopped there. It has not increased any more available space. I tried again both commands and there were no changes again.

See that on all computers, I have from 185 to a maximum of 203 GB of total disk occupied in a 5.6TB bcache device.

root@pve-00-005:~# ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE  VAR   PGS  STATUS
 0    hdd  5.57269   1.00000  5.6 TiB  185 GiB   68 GiB    6 KiB  1.3 GiB  5.4 TiB  3.25  0.95   24      up
 1    hdd  5.57269   1.00000  5.6 TiB  197 GiB   80 GiB  2.8 MiB  1.4 GiB  5.4 TiB  3.46  1.01   31      up
 2    hdd  5.57269   1.00000  5.6 TiB  203 GiB   86 GiB  2.8 MiB  1.6 GiB  5.4 TiB  3.56  1.04   30      up
 3    hdd  5.57269   1.00000  5.6 TiB  197 GiB   80 GiB  2.8 MiB  1.5 GiB  5.4 TiB  3.45  1.01   31      up
 4    hdd  5.57269   1.00000  5.6 TiB  194 GiB   76 GiB    5 KiB  361 MiB  5.4 TiB  3.39  0.99   26      up
 5    hdd  5.57269   1.00000  5.6 TiB  187 GiB   69 GiB    5 KiB  1.1 GiB  5.4 TiB  3.27  0.96   25      up
 6    hdd  5.57269   1.00000  5.6 TiB  202 GiB   84 GiB    5 KiB  1.5 GiB  5.4 TiB  3.54  1.04   28      up
                       TOTAL   39 TiB  1.3 TiB  543 GiB  8.4 MiB  8.8 GiB   38 TiB  3.42                   
MIN/MAX VAR: 0.95/1.04  STDDEV: 0.11
root@pve-00-005:~#

But when I look inside the bcache devices, the caches are all pretty much full, with a maximum of 5% free (at best). This after many hours stopped and after the aforementioned commands.

root@pve-00-001:~# cat /sys/block/bcache0/bcache/cache/cache0/priority_stats
Unused:         4%
Clean:          95%
Dirty:          0%
Metadata:       0%
Average:        1145
Sectors per Q:  36244576
Quantiles:      [8 24 39 56 84 112 155 256 392 476 605 714 825 902 988 1070 1184 1273 1369 1475 1568 1686 1775 1890 1994 2088 2212 2323 2441 2553 2693]
root@pve-00-001:~#

root@pve-00-002:~# cat /sys/block/bcache0/bcache/cache/cache0/priority_stats
Unused:         4%
Clean:          95%
Dirty:          0%
Metadata:       0%
Average:        1143
Sectors per Q:  36245072
Quantiles:      [10 25 42 78 107 147 201 221 304 444 529 654 757 863 962 1057 1146 1264 1355 1469 1568 1664 1773 1885 2001 2111 2241 2368 2490 2613 2779]
root@pve-00-002:~#

root@pve-00-003:~# cat /sys/block/bcache0/bcache/cache/cache0/priority_stats
Unused:         2%
Clean:          97%
Dirty:          0%
Metadata:       0%
Average:        971
Sectors per Q:  36244400
Quantiles:      [8 21 36 51 87 127 161 181 217 278 435 535 627 741 825 919 993 1080 1165 1239 1340 1428 1503 1611 1716 1815 1945 2037 2129 2248 2357]
root@pve-00-003:~#

root@pve-00-004:~# cat /sys/block/bcache0/bcache/cache/cache0/priority_stats
Unused:         5%
Clean:          94%
Dirty:          0%
Metadata:       0%
Average:        1133
Sectors per Q:  36243024
Quantiles:      [10 26 41 57 92 121 152 192 289 440 550 645 806 913 989 1068 1170 1243 1371 1455 1567 1656 1746 1887 1996 2107 2201 2318 2448 2588 2729]
root@pve-00-004:~#

root@pve-00-005:~# cat /sys/block/bcache0/bcache/cache/cache0/priority_stats
Unused:         2%
Clean:          97%
Dirty:          0%
Metadata:       0%
Average:        1076
Sectors per Q:  36245312
Quantiles:      [10 25 42 59 93 115 139 218 276 368 478 568 676 770 862 944 1090 1178 1284 1371 1453 1589 1700 1814 1904 1990 2147 2264 2386 2509 2679]
root@pve-00-005:~#

root@pve-00-006:~# cat /sys/block/bcache0/bcache/cache/cache0/priority_stats
Unused:         4%
Clean:          95%
Dirty:          0%
Metadata:       0%
Average:        1085
Sectors per Q:  36244688
Quantiles:      [10 27 45 68 101 137 175 234 365 448 547 651 757 834 921 1001 1098 1185 1283 1379 1470 1575 1673 1781 1892 1994 2102 2216 2336 2461 2606]
root@pve-00-006:~#

root@pve-00-007:~# cat /sys/block/bcache0/bcache/cache/cache0/priority_stats
Unused:         4%
Clean:          95%
Dirty:          0%
Metadata:       0%
Average:        1061
Sectors per Q:  36244160
Quantiles:      [10 24 40 56 94 132 177 233 275 326 495 602 704 846 928 1014 1091 1180 1276 1355 1471 1572 1665 1759 1862 1952 2087 2179 2292 2417 2537]
root@pve-00-007:~#

As you can see, out of 7 servers, they all range from 2 to a maximum of 5% unused space. Though none have even 4% of the mass disk space occupied.

Little has changed after many hours with the system on but no new writes or reads to the bcache device. Only this time I turned on the virtual machine, but it stayed on for a short time. Even so, as you can see, almost nothing has changed and there are still disks with no cache space available. I can bet that, in this situation, with a few minutes of use, the disks will all be 100% full again.

It's even a funny situation, because the usable space used (with real data) in the bcache device doesn't reach half of what is actually occupied in the cache. It's as if it keeps in the cache, even data that has already been deleted from the device.

Is there a solution?

Grateful,

Em quinta-feira, 6 de abril de 2023 às 18:21:20 BRT, Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx> escreveu: 

On Wed, 5 Apr 2023, Adriano Silva wrote:
> > Can you try to write 1 to cache set sysfs file 
> > gc_after_writeback? 
> > When it is set, a gc will be waken up automatically after 
> > all writeback accomplished. Then most of the clean cache 
> > might be shunk and the B+tree nodes will be deduced 
> > quite a lot.
> 
> Would this be the command you ask me for?
> 
> root@pve-00-005:~# echo 1 > /sys/fs/bcache/a18394d8-186e-44f9-979a-8c07cb3fbbcd/internal/gc_after_writeback
> 
> If this command is correct, I already advance that it did not give the 
> expected result. The Cache continues with 100% of the occupied space. 
> Nothing has changed despite the cache being cleaned and having written 
> the command you recommended. Let's see:

Did you try to trigger gc after setting gc_after_writeback=1?

        echo 1 > /sys/block/bcache0/bcache/cache/internal/trigger_gc

The `gc_after_writeback=1` setting might not trigger until writeback 
finishes, but if writeback is already finished and there is no new IO then 
it may never trigger unless it is forced via `tigger_gc`

-Eric

> root@pve-00-005:~# cat /sys/block/bcache0/bcache/cache/cache0/priority_stats
> Unused:         0%
> Clean:          98%
> Dirty:          1%
> Metadata:       0%
> Average:        1137
> Sectors per Q:  36245232
> Quantiles:      [12 26 42 60 80 127 164 237 322 426 552 651 765 859 948 1030 1176 1264 1370 1457 1539 1674 1786 1899 1989 2076 2232 2350 2471 2594 2764]
> 
> But if there was any movement on the disks after the command, I couldn't detect it:
> 
> root@pve-00-005:~# dstat -drt -D sdc,nvme0n1,bcache0
> --dsk/sdc---dsk/nvme0n1-dsk/bcache0 ---io/sdc----io/nvme0n1--io/bcache0 ----system----
>  read  writ: read  writ: read  writ| read  writ: read  writ: read  writ|     time     
>   54k  153k: 300k  221k: 222k  169k|0.67  0.53 :6.97  20.4 :6.76  12.3 |05-04 15:28:57
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:28:58
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:28:59
\>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 
|05-04 15:29:00
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:29:01
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:29:02
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:29:03
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |05-04 15:29:04^C
> root@pve-00-005:~#
> 
> Why were there no changes?
> 
> > Currently there is no such option for limit bcache 
> > in-memory B+tree nodes cache occupation, but when I/O 
> > load reduces, such memory consumption may drop very 
> > fast by the reaper from system memory management 
> > code. So it won’t be a problem. Bcache will try to use any 
> > possible memory for B+tree nodes cache if it is 
> > necessary, and throttle I/O performance to return these 
> > memory back to memory management code when the 
> > available system memory is low. By default, it should 
> > work well and nothing should be done from user.
> 
> I've been following the server's operation a lot and I've never seen less than 50 GB of free RAM memory. Let's see: 
> 
> root@pve-00-005:~# free               total        used        free      shared  buff/cache   available
> Mem:       131980688    72670448    19088648       76780    40221592    57335704
> Swap:              0           0           0
> root@pve-00-005:~#
> 
> There is always plenty of free RAM, which makes me ask: Could there really be a problem related to a lack of RAM?
> 
> > Bcache doesn’t issue trim request proactively. 
> > [...]
> > In run time, bcache code only forward the trim request to backing device (not cache device).
> 
> Wouldn't it be advantageous if bcache sent TRIM (discard) to the cache temporarily? I believe flash drives (SSD or NVMe) that need TRIM to maintain top performance are typically used as a cache for bcache. So, I think that if the TRIM command was used regularly by bcache, in the background (only for clean and free buckets), with a controlled frequency, or even if executed by a manually triggered by the user background task (always only for clean and free buckets), it could help to reduce the write latency of the cache. I believe it would help the writeback efficiency a lot. What do you think about this?
> 
> Anyway, this issue of the free buckets not appearing is keeping me awake at night. Could it be a problem with my Kernel version (Linux 5.15)?
> 
> As I mentioned before, I saw in the bcache documentation (https://docs.kernel.org/admin-guide/bcache.html) a variable (freelist_percent) that was supposed to control a minimum rate of free buckets. Could it be a solution? I don't know. But in practice, I didn't find this variable in my system (could it be because of the OS version?)
> 
> Thank you very much!
> 
> 
> 
> Em quarta-feira, 5 de abril de 2023 às 10:57:58 BRT, Coly Li <colyli@xxxxxxx> escreveu: 
> 
> 
> 
> 
> 
> 
> 
> > 2023年4月5日 04:29，Adriano Silva <adriano_da_silva@xxxxxxxxxxxx> 写道：
> > 
> > Hello,
> > 
> >> It sounds like a large cache size with limit memory cache 
> >> for B+tree nodes?
> > 
> >> If the memory is limited and all B+tree nodes in the hot I/O 
> >> paths cannot stay in memory, it is possible for such 
> >> behavior happens. In this case, shrink the cached data 
> >> may deduce the meta data and consequential in-memory 
> >> B+tree nodes as well. Yes it may be helpful for such 
> >> scenario.
> > 
> > There are several servers (TEN) all with 128 GB of RAM, of which around 100GB (on average) are presented by the OS as free. Cache is 594GB in size on enterprise NVMe, mass storage is 6TB. The configuration on all is the same. They run Ceph OSD to service a pool of disks accessed by servers (others including themselves).
> > 
> > All show the same behavior.
> > 
> > When they were installed, they did not occupy the entire cache. Throughout use, the cache gradually filled up and  never decreased in size. I have another five servers in  another cluster going the same way. During the night  their workload is reduced.
> 
> Copied.
> 
> > 
> >> But what is the I/O pattern here? If all the cache space 
> >> occupied by clean data for read request, and write 
> >> performance is cared about then. Is this a write tended, 
> >> or read tended workload, or mixed?
> > 
> > The workload is greater in writing. Both are important, read and write. But write latency is critical. These are virtual machine disks that are stored on Ceph. Inside we have mixed loads, Windows with terminal service, Linux, including a database where direct write latency is critical.
> 
> 
> Copied.
> 
> > 
> >> As I explained, the re-reclaim has been here already. 
> >> But it cannot help too much if busy I/O requests always 
> >> coming and writeback and gc threads have no spare 
> >> time to perform.
> > 
> >> If coming I/Os exceeds the service capacity of the 
> >> cache service window, disappointed requesters can 
> >> be expected.
> > 
> > Today, the ten servers have been without I/O operation for at least 24 hours. Nothing has changed, they continue with 100% cache occupancy. I believe I should have given time for the GC, no?
> 
> This is nice. Now we have the maximum writeback thoughput after I/O idle for a while, so after 24 hours all dirty data should be written back and the whole cache might be clean.
> 
> I guess just a gc is needed here.
> 
> Can you try to write 1 to cache set sysfs file gc_after_writeback? When it is set, a gc will be waken up automatically after all writeback accomplished. Then most of the clean cache might be shunk and the B+tree nodes will be deduced quite a lot.
> 
> 
> > 
> >> Let’s check whether it is just becasue of insuffecient 
> >> memory to hold the hot B+tree node in memory.
> > 
> > Does the bcache configuration have any RAM memory reservation options? Or would the 100GB of RAM be insufficient for the 594GB of NVMe Cache? For that amount of Cache, how much RAM should I have reserved for bcache? Is there any command or parameter I should use to signal bcache that it should reserve this RAM memory? I didn't do anything about this matter. How would I do it?
> > 
> 
> Currently there is no such option for limit bcache in-memory B+tree nodes cache occupation, but when I/O load reduces, such memory consumption may drop very fast by the reaper from system memory management code. So it won’t be a problem. Bcache will try to use any possible memory for B+tree nodes cache if it is necessary, and throttle I/O performance to return these memory back to memory management code when the available system memory is low. By default, it should work well and nothing should be done from user. 
> 
> > Another question: How do I know if I should trigger a TRIM (discard) for my NVMe with bcache?
> 
> Bcache doesn’t issue trim request proactively. The bcache program from bcache-tools may issue a discard request when you run,
>     bcache make -C <cache device path>
> to create a cache device.
> 
> In run time, bcache code only forward the trim request to backing device (not cache device).
> 
> 
> 
> Thanks.
> 
> Coly Li
> 
> 
> 
> > 
> [snipped]
> 
> 
>