Re: Writeback cache all used.

Adriano Silva <adriano_da_silva@xxxxxxxxxxxx> · Tue, 4 Apr 2023 20:29:00 +0000 (UTC)

Hello,

> It sounds like a large cache size with limit memory cache 
> for B+tree nodes?

> If the memory is limited and all B+tree nodes in the hot I/O 
> paths cannot stay in memory, it is possible for such 
> behavior happens. In this case, shrink the cached data 
> may deduce the meta data and consequential in-memory 
> B+tree nodes as well. Yes it may be helpful for such 
> scenario.

There are several servers (TEN) all with 128 GB of RAM, of which around 100GB (on average) are presented by the OS as free. Cache is 594GB in size on enterprise NVMe, mass storage is 6TB. The configuration on all is the same. They run Ceph OSD to service a pool of disks accessed by servers (others including themselves).

All show the same behavior.

When they were installed, they did not occupy the entire cache. Throughout use, the cache gradually filled up and  never decreased in size. I have another five servers in  another cluster going the same way. During the night  their workload is reduced.

> But what is the I/O pattern here? If all the cache space 
> occupied by clean data for read request, and write 
> performance is cared about then. Is this a write tended, 
> or read tended workload, or mixed?

The workload is greater in writing. Both are important, read and write. But write latency is critical. These are virtual machine disks that are stored on Ceph. Inside we have mixed loads, Windows with terminal service, Linux, including a database where direct write latency is critical.

> As I explained, the re-reclaim has been here already. 
> But it cannot help too much if busy I/O requests always 
> coming and writeback and gc threads have no spare 
> time to perform.

> If coming I/Os exceeds the service capacity of the 
> cache service window, disappointed requesters can 
> be expected.

Today, the ten servers have been without I/O operation for at least 24 hours. Nothing has changed, they continue with 100% cache occupancy. I believe I should have given time for the GC, no?

> Let’s check whether it is just becasue of insuffecient 
> memory to hold the hot B+tree node in memory.

Does the bcache configuration have any RAM memory reservation options? Or would the 100GB of RAM be insufficient for the 594GB of NVMe Cache? For that amount of Cache, how much RAM should I have reserved for bcache? Is there any command or parameter I should use to signal bcache that it should reserve this RAM memory? I didn't do anything about this matter. How would I do it?

Another question: How do I know if I should trigger a TRIM (discard) for my NVMe with bcache?

Best regards,

Em terça-feira, 4 de abril de 2023 às 05:20:06 BRT, Coly Li <colyli@xxxxxxx> escreveu: 

> 2023年4月4日 03:27，Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx> 写道：
> 
> On Mon, 3 Apr 2023, Coly Li wrote:
>>> 2023年4月2日 08:01，Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx> 写道：
>>> 
>>> On Fri, 31 Mar 2023, Adriano Silva wrote:
>>>> Thank you very much!
>>>> 
>>>>> I don't know for sure, but I'd think that since 91% of the cache is
>>>>> evictable, writing would just evict some data from the cache (without
>>>>> writing to the HDD, since it's not dirty data) and write to that area of
>>>>> the cache, *not* to the HDD. It wouldn't make sense in many cases to
>>>>> actually remove data from the cache, because then any reads of that data
>>>>> would have to read from the HDD; leaving it in the cache has very little
>>>>> cost and would speed up any reads of that data.
>>>> 
>>>> Maybe you're right, it seems to be writing to the cache, despite it 
>>>> indicating that the cache is at 100% full.
>>>> 
>>>> I noticed that it has excellent reading performance, but the writing 
>>>> performance dropped a lot when the cache was full. It's still a higher 
>>>> performance than the HDD, but much lower than it is when it's half full 
>>>> or empty.
>>>> 
>>>> Sequential writing tests with "_fio" now show me 240MB/s of writing, 
>>>> which was already 900MB/s when the cache was still half full. Write 
>>>> latency has also increased. IOPS on random 4K writes are now in the 5K 
>>>> range. It was 16K with half used cache. At random 4K with Ioping, 
>>>> latency went up. With half cache it was 500us. It is now 945us.
>>>> 
>>>> For reading, nothing has changed.
>>>> 
>>>> However, for systems where writing time is critical, it makes a 
>>>> significant difference. If possible I would like to always keep it with 
>>>> a reasonable amount of empty space, to improve writing responses. Reduce 
>>>> 4K latency, mostly. Even if it were for me to program a script in 
>>>> crontab or something like that, so that during the night or something 
>>>> like that the system executes a command for it to clear a percentage of 
>>>> the cache (about 30% for example) that has been unused for the longest 
>>>> time . This would possibly make the cache more efficient on writes as 
>>>> well.
>>> 
>>> That is an intersting idea since it saves latency. Keeping a few unused 
>>> ready to go would prevent GC during a cached write. 
>>> 
>> 
>> Currently there are around 10% reserved already, if dirty data exceeds 
>> the threshold further writing will go into backing device directly.
> 
> It doesn't sound like he is referring to dirty data.  If I understand 
> correctly, he means that when the cache is 100% allocated (whether or not 
> anything is dirty) that latency is 2x what it could be compared to when 
> there are unallocated buckets ready for writing (ie, after formatting the 
> cache, but before it fully allocates).  His sequential throughput was 4x 
> slower with a 100% allocated cache: 900MB/s at 50% full after a format, 
> but only 240MB/s when the cache buckets are 100% allocated.
> 

It sounds like a large cache size with limit memory cache for B+tree nodes?

If the memory is limited and all B+tree nodes in the hot I/O paths cannot stay in memory, it is possible for such behavior happens. In this case, shrink the cached data may deduce the meta data and consequential in-memory B+tree nodes as well. Yes it may be helpful for such scenario.

But what is the I/O pattern here? If all the cache space occupied by clean data for read request, and write performance is cared about then. Is this a write tended, or read tended workload, or mixed?

It is suspicious that the meta data I/O to read-in missing B+tree nodes contributes the I/O performance degrade. But this is only a guess from me.

>> Reserve more space doesn’t change too much, if there are always busy 
>> write request arriving. For occupied clean cache space, I tested years 
>> ago, the space can be shrunk very fast and it won’t be a performance 
>> bottleneck. If the situation changes now, please inform me.
> 
> His performance specs above indicate that 100% occupided but clean cache 
> increases latency (due to release/re-allocate overhead).  The increased 
> latency reduces effective throughput.
> 

Maybe, increase a bit memory may help a lot. But if this is not a server machine, e.g. laptop, increasing DIMM size might be unfeasible for now days models.

Let’s check whether the memory is insufficient for this case firstly.

>>> Coly, would this be an easy feature to add?
>> 
>> To make it, the change won’t be complexed. But I don’t feel it may solve 
>> the original writing performance issue when space is almost full. In the 
>> code we already have similar lists to hold available buckets for future 
>> data/metadata allocation.
> 
> Questions: 
> 
> - Right after a cache is formated and attached to a bcache volume, which 
>  list contains the buckets that have never been used?

All buckets on cache are allocated in sequential-like order, no such dedicate list to track never-used buckets. There are arrays to trace the buckets dirty state and get number, but not for this purpose. Maintain such list is expensive when cache size is large.

> 
> - Where does bcache insert back onto that list?  Does it?

Even for synchronized bucket invalidate by fifo/lru/random, the selection is decided in run time by heap sort, which is still quite slow. But this is on the very slow code path, doesn’t hurt too much. more.

> 
>> But if the lists are empty, there is still time required to do the dirty 
>> writeback and garbage collection if necessary.
> 
> True, that code would always remain, no change there.
> 
>>> Bcache would need a `cache_min_free` tunable that would (asynchronously) 
>>> free the least recently used buckets that are not dirty.
>> 
>> For clean cache space, it has been already.
> 
> I'm not sure what you mean by "it has been already" - do you mean this is 
> already implemented?  If so, what/where is the sysfs tunable (or 
> hard-coded min-free-buckets value) ?

For read request, gc thread is still waken up time to time. See the code path,

cached_dev_read_done_bh() ==> cached_dev_read_done() ==> bch_data_insert() ==> bch_data_insert_start() ==> wake_up_gc()

When read missed in cache, writing the clean data from backing device into cache device still occupies cache space, and default every 1/16 space allocated/occupied, the gc thread is waken up asynchronously.

> 
>> This is very fast to shrink clean cache space, I did a test 2 years ago, 
>> it was just not more than 10 seconds to reclaim around 1TB+ clean cache 
>> space. I guess the time might be much less, because reading the 
>> information from priorities file also takes time.
> 
> Reclaiming large chunks of cache is probably fast in one shot, but 
> reclaiming one "clean but allocated" bucket (or even a few buckets) with 
> each WRITE has latency overhead associated with it.  Early reclaim to some 
> reasonable (configrable) minimum free-space value could hide that latency 
> in many workloads.
> 

As I explained, the re-reclaim has been here already. But it cannot help too much if busy I/O requests always coming and writeback and gc threads have no spare time to perform.

If coming I/Os exceeds the service capacity of the cache service window, disappointed requesters can be expected. 

> Long-running bcache volumes are usually 100% allocated, and if freeing 
> batches of clean buckets is fast, then doing it early would save metadata 
> handling during clean bucket re-allocation for new writes (and maybe 
> read-promotion, too).

Let’s check whether it is just becasue of insuffecient memory to hold the hot B+tree node in memory.

Thanks.

Coly Li