Re: Writeback cache all used.

Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx> · Wed, 5 Apr 2023 12:24:35 -0700 (PDT)

On Wed, 5 Apr 2023, Coly Li wrote:
> > 2023年4月5日 04:29，Adriano Silva <adriano_da_silva@xxxxxxxxxxxx> 写道：
> > 
> > Hello,
> > 
> >> It sounds like a large cache size with limit memory cache 
> >> for B+tree nodes?
> > 
> >> If the memory is limited and all B+tree nodes in the hot I/O 
> >> paths cannot stay in memory, it is possible for such 
> >> behavior happens. In this case, shrink the cached data 
> >> may deduce the meta data and consequential in-memory 
> >> B+tree nodes as well. Yes it may be helpful for such 
> >> scenario.
> > 
> > There are several servers (TEN) all with 128 GB of RAM, of which 
> > around 100GB (on average) are presented by the OS as free. Cache is 
> > 594GB in size on enterprise NVMe, mass storage is 6TB. The 
> > configuration on all is the same. They run Ceph OSD to service a pool 
> > of disks accessed by servers (others including themselves).
> > 
> > All show the same behavior.
> > 
> > When they were installed, they did not occupy the entire cache. 
> > Throughout use, the cache gradually filled up and never decreased in 
> > size. I have another five servers in another cluster going the same 
> > way. During the night their workload is reduced.
> 
> Copied.
> 
> >> But what is the I/O pattern here? If all the cache space occupied by 
> >> clean data for read request, and write performance is cared about 
> >> then. Is this a write tended, or read tended workload, or mixed?
> > 
> > The workload is greater in writing. Both are important, read and 
> > write. But write latency is critical. These are virtual machine disks 
> > that are stored on Ceph. Inside we have mixed loads, Windows with 
> > terminal service, Linux, including a database where direct write 
> > latency is critical.
> 
> 
> Copied.
> 
> >> As I explained, the re-reclaim has been here already. But it cannot 
> >> help too much if busy I/O requests always coming and writeback and gc 
> >> threads have no spare time to perform.
> >>
> >> If coming I/Os exceeds the service capacity of the cache service 
> >> window, disappointed requesters can be expected.
> > 
> > Today, the ten servers have been without I/O operation for at least 24 
> > hours. Nothing has changed, they continue with 100% cache occupancy. I 
> > believe I should have given time for the GC, no?
> 
> This is nice. Now we have the maximum writeback thoughput after I/O idle 
> for a while, so after 24 hours all dirty data should be written back and 
> the whole cache might be clean.
> 
> I guess just a gc is needed here.
> 
> Can you try to write 1 to cache set sysfs file gc_after_writeback? When 
> it is set, a gc will be waken up automatically after all writeback 
> accomplished. Then most of the clean cache might be shunk and the B+tree 
> nodes will be deduced quite a lot.

If writeback is done then you might need this to trigger it, too:
	echo 1 > /sys/block/bcache0/bcache/cache/internal/trigger_gc

Question for Coly: Will `gc_after_writeback` evict read-promoted pages, 
too, making this effectively a writeback-only cache?

Here is the commit:
	https://patchwork.kernel.org/project/linux-block/patch/20181213145357.38528-9-colyli@xxxxxxx/

> >> Let’s check whether it is just becasue of insuffecient 
> >> memory to hold the hot B+tree node in memory.
> > 
> > Does the bcache configuration have any RAM memory reservation options? 
> > Or would the 100GB of RAM be insufficient for the 594GB of NVMe Cache? 
> > For that amount of Cache, how much RAM should I have reserved for 
> > bcache? Is there any command or parameter I should use to signal 
> > bcache that it should reserve this RAM memory? I didn't do anything 
> > about this matter. How would I do it?
> > 
> 
> Currently there is no such option for limit bcache in-memory B+tree 
> nodes cache occupation, but when I/O load reduces, such memory 
> consumption may drop very fast by the reaper from system memory 
> management code. So it won’t be a problem. Bcache will try to use any 
> possible memory for B+tree nodes cache if it is necessary, and throttle 
> I/O performance to return these memory back to memory

Does bcache intentionally throttle I/O under memory pressure, or is the 
I/O throttling just a side-effect of increased memory pressure caused by 
fewer B+tree nodes in cache?

> management code when the available system memory is low. By default, it 
> should work well and nothing should be done from user.
> 
> > Another question: How do I know if I should trigger a TRIM (discard) 
> > for my NVMe with bcache?
> 
> Bcache doesn’t issue trim request proactively. 

Are you sure?  Maybe I misunderstood the code here, but it looks like 
buckets get a discard during allocation:
	https://elixir.bootlin.com/linux/v6.3-rc5/source/drivers/md/bcache/alloc.c#L335

	static int bch_allocator_thread(void *arg)
	{
		...
		while (1) {
			long bucket;

			if (!fifo_pop(&ca->free_inc, bucket))
				break;

			if (ca->discard) {
				mutex_unlock(&ca->set->bucket_lock);
				blkdev_issue_discard(ca->bdev, <<<<<<<<<<<<<
					bucket_to_sector(ca->set, bucket),
					ca->sb.bucket_size, GFP_KERNEL);
				mutex_lock(&ca->set->bucket_lock);
			}
			...
		}
		...
	}

-Eric

> The bcache program from bcache-tools may issue a discard request when 
> you run,
> 	bcache make -C <cache device path>
> to create a cache device.
> 
> In run time, bcache code only forward the trim request to backing device (not cache device).

> 
> 
> Thanks.
> 
> Coly Li
>  
> 
> > 
> [snipped]
> 
>