On Wed, 5 Apr 2023, Coly Li wrote: > > 2023年4月5日 04:29,Adriano Silva <adriano_da_silva@xxxxxxxxxxxx> 写道: > > > > Hello, > > > >> It sounds like a large cache size with limit memory cache > >> for B+tree nodes? > > > >> If the memory is limited and all B+tree nodes in the hot I/O > >> paths cannot stay in memory, it is possible for such > >> behavior happens. In this case, shrink the cached data > >> may deduce the meta data and consequential in-memory > >> B+tree nodes as well. Yes it may be helpful for such > >> scenario. > > > > There are several servers (TEN) all with 128 GB of RAM, of which > > around 100GB (on average) are presented by the OS as free. Cache is > > 594GB in size on enterprise NVMe, mass storage is 6TB. The > > configuration on all is the same. They run Ceph OSD to service a pool > > of disks accessed by servers (others including themselves). > > > > All show the same behavior. > > > > When they were installed, they did not occupy the entire cache. > > Throughout use, the cache gradually filled up and never decreased in > > size. I have another five servers in another cluster going the same > > way. During the night their workload is reduced. > > Copied. > > >> But what is the I/O pattern here? If all the cache space occupied by > >> clean data for read request, and write performance is cared about > >> then. Is this a write tended, or read tended workload, or mixed? > > > > The workload is greater in writing. Both are important, read and > > write. But write latency is critical. These are virtual machine disks > > that are stored on Ceph. Inside we have mixed loads, Windows with > > terminal service, Linux, including a database where direct write > > latency is critical. > > > Copied. > > >> As I explained, the re-reclaim has been here already. But it cannot > >> help too much if busy I/O requests always coming and writeback and gc > >> threads have no spare time to perform. > >> > >> If coming I/Os exceeds the service capacity of the cache service > >> window, disappointed requesters can be expected. > > > > Today, the ten servers have been without I/O operation for at least 24 > > hours. Nothing has changed, they continue with 100% cache occupancy. I > > believe I should have given time for the GC, no? > > This is nice. Now we have the maximum writeback thoughput after I/O idle > for a while, so after 24 hours all dirty data should be written back and > the whole cache might be clean. > > I guess just a gc is needed here. > > Can you try to write 1 to cache set sysfs file gc_after_writeback? When > it is set, a gc will be waken up automatically after all writeback > accomplished. Then most of the clean cache might be shunk and the B+tree > nodes will be deduced quite a lot. If writeback is done then you might need this to trigger it, too: echo 1 > /sys/block/bcache0/bcache/cache/internal/trigger_gc Question for Coly: Will `gc_after_writeback` evict read-promoted pages, too, making this effectively a writeback-only cache? Here is the commit: https://patchwork.kernel.org/project/linux-block/patch/20181213145357.38528-9-colyli@xxxxxxx/ > >> Let’s check whether it is just becasue of insuffecient > >> memory to hold the hot B+tree node in memory. > > > > Does the bcache configuration have any RAM memory reservation options? > > Or would the 100GB of RAM be insufficient for the 594GB of NVMe Cache? > > For that amount of Cache, how much RAM should I have reserved for > > bcache? Is there any command or parameter I should use to signal > > bcache that it should reserve this RAM memory? I didn't do anything > > about this matter. How would I do it? > > > > Currently there is no such option for limit bcache in-memory B+tree > nodes cache occupation, but when I/O load reduces, such memory > consumption may drop very fast by the reaper from system memory > management code. So it won’t be a problem. Bcache will try to use any > possible memory for B+tree nodes cache if it is necessary, and throttle > I/O performance to return these memory back to memory Does bcache intentionally throttle I/O under memory pressure, or is the I/O throttling just a side-effect of increased memory pressure caused by fewer B+tree nodes in cache? > management code when the available system memory is low. By default, it > should work well and nothing should be done from user. > > > Another question: How do I know if I should trigger a TRIM (discard) > > for my NVMe with bcache? > > Bcache doesn’t issue trim request proactively. Are you sure? Maybe I misunderstood the code here, but it looks like buckets get a discard during allocation: https://elixir.bootlin.com/linux/v6.3-rc5/source/drivers/md/bcache/alloc.c#L335 static int bch_allocator_thread(void *arg) { ... while (1) { long bucket; if (!fifo_pop(&ca->free_inc, bucket)) break; if (ca->discard) { mutex_unlock(&ca->set->bucket_lock); blkdev_issue_discard(ca->bdev, <<<<<<<<<<<<< bucket_to_sector(ca->set, bucket), ca->sb.bucket_size, GFP_KERNEL); mutex_lock(&ca->set->bucket_lock); } ... } ... } -Eric > The bcache program from bcache-tools may issue a discard request when > you run, > bcache make -C <cache device path> > to create a cache device. > > In run time, bcache code only forward the trim request to backing device (not cache device). > > > Thanks. > > Coly Li > > > > > [snipped] > >