Re: Writeback cache all used.

Coly Li <colyli@xxxxxxx> · Wed, 5 Apr 2023 21:57:42 +0800

> 2023年4月5日 04:29，Adriano Silva <adriano_da_silva@xxxxxxxxxxxx> 写道：
> 
> Hello,
> 
>> It sounds like a large cache size with limit memory cache 
>> for B+tree nodes?
> 
>> If the memory is limited and all B+tree nodes in the hot I/O 
>> paths cannot stay in memory, it is possible for such 
>> behavior happens. In this case, shrink the cached data 
>> may deduce the meta data and consequential in-memory 
>> B+tree nodes as well. Yes it may be helpful for such 
>> scenario.
> 
> There are several servers (TEN) all with 128 GB of RAM, of which around 100GB (on average) are presented by the OS as free. Cache is 594GB in size on enterprise NVMe, mass storage is 6TB. The configuration on all is the same. They run Ceph OSD to service a pool of disks accessed by servers (others including themselves).
> 
> All show the same behavior.
> 
> When they were installed, they did not occupy the entire cache. Throughout use, the cache gradually filled up and  never decreased in size. I have another five servers in  another cluster going the same way. During the night  their workload is reduced.

Copied.

> 
>> But what is the I/O pattern here? If all the cache space 
>> occupied by clean data for read request, and write 
>> performance is cared about then. Is this a write tended, 
>> or read tended workload, or mixed?
> 
> The workload is greater in writing. Both are important, read and write. But write latency is critical. These are virtual machine disks that are stored on Ceph. Inside we have mixed loads, Windows with terminal service, Linux, including a database where direct write latency is critical.

Copied.

> 
>> As I explained, the re-reclaim has been here already. 
>> But it cannot help too much if busy I/O requests always 
>> coming and writeback and gc threads have no spare 
>> time to perform.
> 
>> If coming I/Os exceeds the service capacity of the 
>> cache service window, disappointed requesters can 
>> be expected.
> 
> Today, the ten servers have been without I/O operation for at least 24 hours. Nothing has changed, they continue with 100% cache occupancy. I believe I should have given time for the GC, no?

This is nice. Now we have the maximum writeback thoughput after I/O idle for a while, so after 24 hours all dirty data should be written back and the whole cache might be clean.

I guess just a gc is needed here.

Can you try to write 1 to cache set sysfs file gc_after_writeback? When it is set, a gc will be waken up automatically after all writeback accomplished. Then most of the clean cache might be shunk and the B+tree nodes will be deduced quite a lot.

> 
>> Let’s check whether it is just becasue of insuffecient 
>> memory to hold the hot B+tree node in memory.
> 
> Does the bcache configuration have any RAM memory reservation options? Or would the 100GB of RAM be insufficient for the 594GB of NVMe Cache? For that amount of Cache, how much RAM should I have reserved for bcache? Is there any command or parameter I should use to signal bcache that it should reserve this RAM memory? I didn't do anything about this matter. How would I do it?
> 

Currently there is no such option for limit bcache in-memory B+tree nodes cache occupation, but when I/O load reduces, such memory consumption may drop very fast by the reaper from system memory management code. So it won’t be a problem. Bcache will try to use any possible memory for B+tree nodes cache if it is necessary, and throttle I/O performance to return these memory back to memory management code when the available system memory is low. By default, it should work well and nothing should be done from user. 

> Another question: How do I know if I should trigger a TRIM (discard) for my NVMe with bcache?

Bcache doesn’t issue trim request proactively. The bcache program from bcache-tools may issue a discard request when you run,
	bcache make -C <cache device path>
to create a cache device.

In run time, bcache code only forward the trim request to backing device (not cache device).

Thanks.

Coly Li

> 
[snipped]