Re: Writeback cache all used.

Coly Li <colyli@xxxxxxx> · Tue, 4 Apr 2023 16:19:50 +0800

> 2023年4月4日 03:27，Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx> 写道：
> 
> On Mon, 3 Apr 2023, Coly Li wrote:
>>> 2023年4月2日 08:01，Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx> 写道：
>>> 
>>> On Fri, 31 Mar 2023, Adriano Silva wrote:
>>>> Thank you very much!
>>>> 
>>>>> I don't know for sure, but I'd think that since 91% of the cache is
>>>>> evictable, writing would just evict some data from the cache (without
>>>>> writing to the HDD, since it's not dirty data) and write to that area of
>>>>> the cache, *not* to the HDD. It wouldn't make sense in many cases to
>>>>> actually remove data from the cache, because then any reads of that data
>>>>> would have to read from the HDD; leaving it in the cache has very little
>>>>> cost and would speed up any reads of that data.
>>>> 
>>>> Maybe you're right, it seems to be writing to the cache, despite it 
>>>> indicating that the cache is at 100% full.
>>>> 
>>>> I noticed that it has excellent reading performance, but the writing 
>>>> performance dropped a lot when the cache was full. It's still a higher 
>>>> performance than the HDD, but much lower than it is when it's half full 
>>>> or empty.
>>>> 
>>>> Sequential writing tests with "_fio" now show me 240MB/s of writing, 
>>>> which was already 900MB/s when the cache was still half full. Write 
>>>> latency has also increased. IOPS on random 4K writes are now in the 5K 
>>>> range. It was 16K with half used cache. At random 4K with Ioping, 
>>>> latency went up. With half cache it was 500us. It is now 945us.
>>>> 
>>>> For reading, nothing has changed.
>>>> 
>>>> However, for systems where writing time is critical, it makes a 
>>>> significant difference. If possible I would like to always keep it with 
>>>> a reasonable amount of empty space, to improve writing responses. Reduce 
>>>> 4K latency, mostly. Even if it were for me to program a script in 
>>>> crontab or something like that, so that during the night or something 
>>>> like that the system executes a command for it to clear a percentage of 
>>>> the cache (about 30% for example) that has been unused for the longest 
>>>> time . This would possibly make the cache more efficient on writes as 
>>>> well.
>>> 
>>> That is an intersting idea since it saves latency. Keeping a few unused 
>>> ready to go would prevent GC during a cached write. 
>>> 
>> 
>> Currently there are around 10% reserved already, if dirty data exceeds 
>> the threshold further writing will go into backing device directly.
> 
> It doesn't sound like he is referring to dirty data.  If I understand 
> correctly, he means that when the cache is 100% allocated (whether or not 
> anything is dirty) that latency is 2x what it could be compared to when 
> there are unallocated buckets ready for writing (ie, after formatting the 
> cache, but before it fully allocates).  His sequential throughput was 4x 
> slower with a 100% allocated cache: 900MB/s at 50% full after a format, 
> but only 240MB/s when the cache buckets are 100% allocated.
> 

It sounds like a large cache size with limit memory cache for B+tree nodes?

If the memory is limited and all B+tree nodes in the hot I/O paths cannot stay in memory, it is possible for such behavior happens. In this case, shrink the cached data may deduce the meta data and consequential in-memory B+tree nodes as well. Yes it may be helpful for such scenario.

But what is the I/O pattern here? If all the cache space occupied by clean data for read request, and write performance is cared about then. Is this a write tended, or read tended workload, or mixed?

It is suspicious that the meta data I/O to read-in missing B+tree nodes contributes the I/O performance degrade. But this is only a guess from me.

>> Reserve more space doesn’t change too much, if there are always busy 
>> write request arriving. For occupied clean cache space, I tested years 
>> ago, the space can be shrunk very fast and it won’t be a performance 
>> bottleneck. If the situation changes now, please inform me.
> 
> His performance specs above indicate that 100% occupided but clean cache 
> increases latency (due to release/re-allocate overhead).  The increased 
> latency reduces effective throughput.
> 

Maybe, increase a bit memory may help a lot. But if this is not a server machine, e.g. laptop, increasing DIMM size might be unfeasible for now days models.

Let’s check whether the memory is insufficient for this case firstly.

>>> Coly, would this be an easy feature to add?
>> 
>> To make it, the change won’t be complexed. But I don’t feel it may solve 
>> the original writing performance issue when space is almost full. In the 
>> code we already have similar lists to hold available buckets for future 
>> data/metadata allocation.
> 
> Questions: 
> 
> - Right after a cache is formated and attached to a bcache volume, which 
>  list contains the buckets that have never been used?

All buckets on cache are allocated in sequential-like order, no such dedicate list to track never-used buckets. There are arrays to trace the buckets dirty state and get number, but not for this purpose. Maintain such list is expensive when cache size is large.

> 
> - Where does bcache insert back onto that list?  Does it?

Even for synchronized bucket invalidate by fifo/lru/random, the selection is decided in run time by heap sort, which is still quite slow. But this is on the very slow code path, doesn’t hurt too much. more.

> 
>> But if the lists are empty, there is still time required to do the dirty 
>> writeback and garbage collection if necessary.
> 
> True, that code would always remain, no change there.
> 
>>> Bcache would need a `cache_min_free` tunable that would (asynchronously) 
>>> free the least recently used buckets that are not dirty.
>> 
>> For clean cache space, it has been already.
> 
> I'm not sure what you mean by "it has been already" - do you mean this is 
> already implemented?  If so, what/where is the sysfs tunable (or 
> hard-coded min-free-buckets value) ?

For read request, gc thread is still waken up time to time. See the code path,

cached_dev_read_done_bh() ==> cached_dev_read_done() ==> bch_data_insert() ==> bch_data_insert_start() ==> wake_up_gc()

When read missed in cache, writing the clean data from backing device into cache device still occupies cache space, and default every 1/16 space allocated/occupied, the gc thread is waken up asynchronously.

> 
>> This is very fast to shrink clean cache space, I did a test 2 years ago, 
>> it was just not more than 10 seconds to reclaim around 1TB+ clean cache 
>> space. I guess the time might be much less, because reading the 
>> information from priorities file also takes time.
> 
> Reclaiming large chunks of cache is probably fast in one shot, but 
> reclaiming one "clean but allocated" bucket (or even a few buckets) with 
> each WRITE has latency overhead associated with it.  Early reclaim to some 
> reasonable (configrable) minimum free-space value could hide that latency 
> in many workloads.
> 

As I explained, the re-reclaim has been here already. But it cannot help too much if busy I/O requests always coming and writeback and gc threads have no spare time to perform.

If coming I/Os exceeds the service capacity of the cache service window, disappointed requesters can be expected. 

> Long-running bcache volumes are usually 100% allocated, and if freeing 
> batches of clean buckets is fast, then doing it early would save metadata 
> handling during clean bucket re-allocation for new writes (and maybe 
> read-promotion, too).

Let’s check whether it is just becasue of insuffecient memory to hold the hot B+tree node in memory.

Thanks.

Coly Li