> 2023年4月4日 03:27,Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx> 写道: > > On Mon, 3 Apr 2023, Coly Li wrote: >>> 2023年4月2日 08:01,Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx> 写道: >>> >>> On Fri, 31 Mar 2023, Adriano Silva wrote: >>>> Thank you very much! >>>> >>>>> I don't know for sure, but I'd think that since 91% of the cache is >>>>> evictable, writing would just evict some data from the cache (without >>>>> writing to the HDD, since it's not dirty data) and write to that area of >>>>> the cache, *not* to the HDD. It wouldn't make sense in many cases to >>>>> actually remove data from the cache, because then any reads of that data >>>>> would have to read from the HDD; leaving it in the cache has very little >>>>> cost and would speed up any reads of that data. >>>> >>>> Maybe you're right, it seems to be writing to the cache, despite it >>>> indicating that the cache is at 100% full. >>>> >>>> I noticed that it has excellent reading performance, but the writing >>>> performance dropped a lot when the cache was full. It's still a higher >>>> performance than the HDD, but much lower than it is when it's half full >>>> or empty. >>>> >>>> Sequential writing tests with "_fio" now show me 240MB/s of writing, >>>> which was already 900MB/s when the cache was still half full. Write >>>> latency has also increased. IOPS on random 4K writes are now in the 5K >>>> range. It was 16K with half used cache. At random 4K with Ioping, >>>> latency went up. With half cache it was 500us. It is now 945us. >>>> >>>> For reading, nothing has changed. >>>> >>>> However, for systems where writing time is critical, it makes a >>>> significant difference. If possible I would like to always keep it with >>>> a reasonable amount of empty space, to improve writing responses. Reduce >>>> 4K latency, mostly. Even if it were for me to program a script in >>>> crontab or something like that, so that during the night or something >>>> like that the system executes a command for it to clear a percentage of >>>> the cache (about 30% for example) that has been unused for the longest >>>> time . This would possibly make the cache more efficient on writes as >>>> well. >>> >>> That is an intersting idea since it saves latency. Keeping a few unused >>> ready to go would prevent GC during a cached write. >>> >> >> Currently there are around 10% reserved already, if dirty data exceeds >> the threshold further writing will go into backing device directly. > > It doesn't sound like he is referring to dirty data. If I understand > correctly, he means that when the cache is 100% allocated (whether or not > anything is dirty) that latency is 2x what it could be compared to when > there are unallocated buckets ready for writing (ie, after formatting the > cache, but before it fully allocates). His sequential throughput was 4x > slower with a 100% allocated cache: 900MB/s at 50% full after a format, > but only 240MB/s when the cache buckets are 100% allocated. > It sounds like a large cache size with limit memory cache for B+tree nodes? If the memory is limited and all B+tree nodes in the hot I/O paths cannot stay in memory, it is possible for such behavior happens. In this case, shrink the cached data may deduce the meta data and consequential in-memory B+tree nodes as well. Yes it may be helpful for such scenario. But what is the I/O pattern here? If all the cache space occupied by clean data for read request, and write performance is cared about then. Is this a write tended, or read tended workload, or mixed? It is suspicious that the meta data I/O to read-in missing B+tree nodes contributes the I/O performance degrade. But this is only a guess from me. >> Reserve more space doesn’t change too much, if there are always busy >> write request arriving. For occupied clean cache space, I tested years >> ago, the space can be shrunk very fast and it won’t be a performance >> bottleneck. If the situation changes now, please inform me. > > His performance specs above indicate that 100% occupided but clean cache > increases latency (due to release/re-allocate overhead). The increased > latency reduces effective throughput. > Maybe, increase a bit memory may help a lot. But if this is not a server machine, e.g. laptop, increasing DIMM size might be unfeasible for now days models. Let’s check whether the memory is insufficient for this case firstly. >>> Coly, would this be an easy feature to add? >> >> To make it, the change won’t be complexed. But I don’t feel it may solve >> the original writing performance issue when space is almost full. In the >> code we already have similar lists to hold available buckets for future >> data/metadata allocation. > > Questions: > > - Right after a cache is formated and attached to a bcache volume, which > list contains the buckets that have never been used? All buckets on cache are allocated in sequential-like order, no such dedicate list to track never-used buckets. There are arrays to trace the buckets dirty state and get number, but not for this purpose. Maintain such list is expensive when cache size is large. > > - Where does bcache insert back onto that list? Does it? Even for synchronized bucket invalidate by fifo/lru/random, the selection is decided in run time by heap sort, which is still quite slow. But this is on the very slow code path, doesn’t hurt too much. more. > >> But if the lists are empty, there is still time required to do the dirty >> writeback and garbage collection if necessary. > > True, that code would always remain, no change there. > >>> Bcache would need a `cache_min_free` tunable that would (asynchronously) >>> free the least recently used buckets that are not dirty. >> >> For clean cache space, it has been already. > > I'm not sure what you mean by "it has been already" - do you mean this is > already implemented? If so, what/where is the sysfs tunable (or > hard-coded min-free-buckets value) ? For read request, gc thread is still waken up time to time. See the code path, cached_dev_read_done_bh() ==> cached_dev_read_done() ==> bch_data_insert() ==> bch_data_insert_start() ==> wake_up_gc() When read missed in cache, writing the clean data from backing device into cache device still occupies cache space, and default every 1/16 space allocated/occupied, the gc thread is waken up asynchronously. > >> This is very fast to shrink clean cache space, I did a test 2 years ago, >> it was just not more than 10 seconds to reclaim around 1TB+ clean cache >> space. I guess the time might be much less, because reading the >> information from priorities file also takes time. > > Reclaiming large chunks of cache is probably fast in one shot, but > reclaiming one "clean but allocated" bucket (or even a few buckets) with > each WRITE has latency overhead associated with it. Early reclaim to some > reasonable (configrable) minimum free-space value could hide that latency > in many workloads. > As I explained, the re-reclaim has been here already. But it cannot help too much if busy I/O requests always coming and writeback and gc threads have no spare time to perform. If coming I/Os exceeds the service capacity of the cache service window, disappointed requesters can be expected. > Long-running bcache volumes are usually 100% allocated, and if freeing > batches of clean buckets is fast, then doing it early would save metadata > handling during clean bucket re-allocation for new writes (and maybe > read-promotion, too). Let’s check whether it is just becasue of insuffecient memory to hold the hot B+tree node in memory. Thanks. Coly Li