Re: Writeback cache all used.

Adriano Silva <adriano_da_silva@xxxxxxxxxxxx> · Sun, 9 Apr 2023 20:14:57 +0000 (UTC)

Hello Eric !

> Did you try to trigger gc after setting gc_after_writeback=1?
> 
>         echo 1 > /sys/block/bcache0/bcache/cache/internal/trigger_gc
> 
> The `gc_after_writeback=1` setting might not trigger until writeback
> finishes, but if writeback is already finished and there is no new IO then
> it may never trigger unless it is forced via `tigger_gc`
> 
> -Eric

Yes, I use the two commands indicated several times, one after the other, first one, then the other, then in reversed order... successive times, after hours of zero disk writing/reading. On more than one server. I tested it on all my servers actually. And in all, the results are similar, there is no significant cache space flush.

And to make matters worse, in other performance tests, I realized that depending on the size of the block I manipulate, the difference in performance is frightening. With 4MB blocks I can write 691MB/s with freshly formatted cache.

root@pve-01-007:~# ceph tell osd.0 bench                
{                
    bytes_written: 1073741824,                
    blocksize: 4194304,                
    elapsed_sec: 1.5536911500000001,                
    bytes_per_sec: 691090905.67967761,
    iops: 164.76891176216068
}                
root@pve-01-007:~#

In the same test I only get 142MB/s when all cache is occupied.

root@pve-00-005:~# ceph tell osd.0 bench
{
    bytes_written: 1073741824,
    blocksize: 4194304,
    elapsed_sec: 7.5302066820000002,
    bytes_per_sec: 142591281.93209398,
    iops: 33.996410830520148
}
root@pve-00-005:~#

That is, with the cache after all occupied, the bcache can write with only 21% of the performance obtained with the newly formatted cache. It doesn't look like we're talking about exactly the same hardware... Same NVME, same processors, same RAM, same server, same OS, same bcache settings..... If you format the cache, it returns to the original performance.

I'm looking at the bcache source code to see if I can pick up anything that might be useful to me. But the code is big and complex. I confess that it is not quick to understand.

I created a little C program to try and call a built-in bcache function for testing, but I spent Sunday and couldn't even compile the program. It is funny.

But what would the garbage collector be in this case? What I understand is that the "garbage" would be parts of buckets (blocks) that would not have been reused and were "lost" outside the c->free list and also outside the free_inc list. I think that would help yes, but maybe in a very limited way. Is this the condition of most buckets that are in use?

As it seems to me (I could be talking nonsense), what would solve the problem would be to get bcache to allocate an adequate amount of buckets in the c->free list. I see this being mentioned in bcache/alloc.c

Would it be through invalidate_buckets(ca) called through the bch_allocator_thread(void *arg) thread? I don't know. What is limiting the action of this thread? I could not understand.

But here in my anxious ignorance, I'm left thinking maybe this was the way, a way to call this function to invalidate many clean buckets in the lru order and discard them. So I looked for an external interface that calls it, but I didn't find it.

Thank you very much!

Em domingo, 9 de abril de 2023 às 13:37:32 BRT, Coly Li <colyli@xxxxxxx> escreveu: 

> 2023年4月6日 03:31，Adriano Silva <adriano_da_silva@xxxxxxxxxxxx> 写道：
> 
> Hello Coly.
> 
> Yes, the server is always on. I allowed it to stay on for more than 24 hours with zero disk I/O to the bcache device. The result is that there are no movements on the cache or data disks, nor on the bcache device as we can see:
> 
> root@pve-00-005:~# dstat -drt -D sdc,nvme0n1,bcache0
> --dsk/sdc---dsk/nvme0n1-dsk/bcache0 ---io/sdc----io/nvme0n1--io/bcache0 ----system----
>  read  writ: read  writ: read  writ| read  writ: read  writ: read  writ|    time    
>  54k  154k: 301k  221k: 223k  169k|0.67  0.54 :6.99  20.5 :6.77  12.3 |05-04 14:45:50
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:51
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:52
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:53
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:54
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:55
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:56
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:57
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:58
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:45:59
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:00
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:01
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:02
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:03
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:04
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:05
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:06
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 14:46:07
> 
> It can stay like that for hours without showing any, zero data flow, either read or write on any of the devices.
> 
> root@pve-00-005:~# cat /sys/block/bcache0/bcache/state
> clean
> root@pve-00-005:~#
> 
> But look how strange, in another command (priority_stats), it shows that there is still 1% of dirt in the cache. And 0% unused cache space. Even after hours of server on and completely idle:
> 
> root@pve-00-005:~# cat /sys/devices/pci0000:80/0000:80:01.1/0000:82:00.0/nvme/nvme0/nvme0n1/nvme0n1p1/bcache/priority_stats
> Unused:        0%
> Clean:          98%
> Dirty:          1%
> Metadata:      0%
> Average:        1137
> Sectors per Q:  36245232
> Quantiles:      [12 26 42 60 80 127 164 237 322 426 552 651 765 859 948 1030 1176 1264 1370 1457 1539 1674 1786 1899 1989 2076 2232 2350 2471 2594 2764]
> 
> Why is this happening?
> 
>> Can you try to write 1 to cache set sysfs file 
>> gc_after_writeback? 
>> When it is set, a gc will be waken up automatically after 
>> all writeback accomplished. Then most of the clean cache 
>> might be shunk and the B+tree nodes will be deduced 
>> quite a lot.
> 
> Would this be the command you ask me for?
> 
> root@pve-00-005:~# echo 1 > /sys/fs/bcache/a18394d8-186e-44f9-979a-8c07cb3fbbcd/internal/gc_after_writeback
> 
> If this command is correct, I already advance that it did not give the expected result. The Cache continues with 100% of the occupied space. Nothing has changed despite the cache being cleaned and having written the command you recommended. Let's see:
> 
> root@pve-00-005:~# cat /sys/block/bcache0/bcache/cache/cache0/priority_stats
> Unused:        0%
> Clean:          98%
> Dirty:          1%
> Metadata:      0%
> Average:        1137
> Sectors per Q:  36245232
> Quantiles:      [12 26 42 60 80 127 164 237 322 426 552 651 765 859 948 1030 1176 1264 1370 1457 1539 1674 1786 1899 1989 2076 2232 2350 2471 2594 2764]
> 
> But if there was any movement on the disks after the command, I couldn't detect it:
> 
> root@pve-00-005:~# dstat -drt -D sdc,nvme0n1,bcache0
> --dsk/sdc---dsk/nvme0n1-dsk/bcache0 ---io/sdc----io/nvme0n1--io/bcache0 ----system----
>  read  writ: read  writ: read  writ| read  writ: read  writ: read  writ|    time    
>  54k  153k: 300k  221k: 222k  169k|0.67  0.53 :6.97  20.4 :6.76  12.3 |05-04 15:28:57
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:28:58
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:28:59
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:29:00
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:29:01
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:29:02
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:29:03
>    0    0 :  0    0 :  0    0 |  0    0 :  0    0 :  0    0 |05-04 15:29:04^C
> root@pve-00-005:~#
> 
> Why were there no changes?

Thanks for the above information. The result is unexpected from me. Let me check whether the B+tree nodes are not shrunk, this is something should be improved. And when the write erase time matters for write requests, normally it is the condition that heavy write loads coming. In such education, the LBA of the collected buckets might be allocated out very soon even before the SSD controller finishes internal write-erasure by the hint of discard/trim. Therefore issue discard/trim right before writing to this LBA doesn’t help on any write performance and involves in extra unnecessary workload into the SSD controller.

And for nowadays SATA/NVMe SSDs, with your workload described above, the write performance drawback can be almost ignored

> 
>> Currently there is no such option for limit bcache 
>> in-memory B+tree nodes cache occupation, but when I/O 
>> load reduces, such memory consumption may drop very 
>> fast by the reaper from system memory management 
>> code. So it won’t be a problem. Bcache will try to use any 
>> possible memory for B+tree nodes cache if it is 
>> necessary, and throttle I/O performance to return these 
>> memory back to memory management code when the 
>> available system memory is low. By default, it should 
>> work well and nothing should be done from user.
> 
> I've been following the server's operation a lot and I've never seen less than 50 GB of free RAM memory. Let's see: 
> 
> root@pve-00-005:~# free              total        used        free      shared  buff/cache  available
> Mem:      131980688    72670448    19088648      76780    40221592    57335704
> Swap:              0          0          0
> root@pve-00-005:~#
> 
> There is always plenty of free RAM, which makes me ask: Could there really be a problem related to a lack of RAM?

No, this is not because of insufficient memory. From your information the memory is enough.

> 
>> Bcache doesn’t issue trim request proactively. 
>> [...]
>> In run time, bcache code only forward the trim request to backing device (not cache device).
> 
> Wouldn't it be advantageous if bcache sent TRIM (discard) to the cache temporarily? I believe flash drives (SSD or NVMe) that need TRIM to maintain top performance are typically used as a cache for bcache. So, I think that if the TRIM command was used regularly by bcache, in the background (only for clean and free buckets), with a controlled frequency, or even if executed by a manually triggered by the user background task (always only for clean and free buckets), it could help to reduce the write latency of the cache. I believe it would help the writeback efficiency a lot. What do you think about this?

There was such attempt but indeed doesn’t help at all. The reason is, bcache can only know which bucket can be discarded when it is handled by garbage collection. 

> 
> Anyway, this issue of the free buckets not appearing is keeping me awake at night. Could it be a problem with my Kernel version (Linux 5.15)?
> 
> As I mentioned before, I saw in the bcache documentation (https://docs.kernel.org/admin-guide/bcache.html) a variable (freelist_percent) that was supposed to control a minimum rate of free buckets. Could it be a solution? I don't know. But in practice, I didn't find this variable in my system (could it be because of the OS version?)

Let me look into this…

Thanks.

Coly Li