Re: [RFC PATCH 5/7] bcache: limit bcache btree node cache memory consumption by I/O throttle

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2020/1/7 7:08 上午, Eric Wheeler wrote:
> On Tue, 7 Jan 2020, Coly Li wrote:
> 
>> Now most of obvious deadlock or panic problems in bcache are fixed, so
>> bcache now can survie under very high I/O load until ... it consumpts
>> all system memory for bcache btree node cache, and the system hangs or
>> panics.
>>
>> The bcache btree node cache is used to cache the bcace btree node from
>> SSD to memory, when determine whether a data block is cached or not
>> by indexing the btree, the speed can be much more fast by an in-memory
>> search.
>>
>> Before this patch, there is no btree node cache memory limitation, just
>> a slab shrinker callback registered to slab memory manager. If the I/O
>> requests are coming faster than kernel memory management code to shrink
>> the bcache btree node cache memory, it is possible for bcache to consume
>> all available system memory for its btree node cache, and make whole
>> system hang or panic. On high performance machine with many CPU cores,
>> large memory size and SSD capanicity, it is often observed the whole
>> system gets hung or panic afer 12+ hours high small random I/O load
>> (e.g. 30K IOPS with 512 bytes random reads and writes).
>>
>> This patch tries to limit bcache btree node cache memory consumption by
>> I/O throttle. The idea is,
>> 1) Add kernel thread c->btree_cache_shrink_thread to shrink in-memory
>>    cached btree nodes.
>> 2) Add a threshold c->btree_cache_threshold to limit number of in-memory
>>    cached btree node, if c->btree_cache_used reaches the threshold, wake
>>    up the shrink kernel thread to shrink in-memory cached btree nodes.
>> 3) In the shrink kernel thread, call __bch_mca_scan() with reap_flush
>>    parameter set to true, then all candidate clean and dirty (flush to
>>    SSD before reap) btree nodes will be shrunk.
>> 4) In the shrink kernel thread main loop, try to shrink 100 btree nodes
>>    by calling __bch_mca_scan(). Inside __bch_mca_scan() c->bucket_lock
>>    is held during shrinking all the 100 btree nodes. The shrinking will
>>    stop until in-memory cached btree node number less than
>> 	c->btree_cache_threshold - MCA_SHRINK_DEFAULT_HYSTERESIS
>>    MCA_SHRINK_DEFAULT_HYSTERESIS is used to avoid the shrinking kernel
>>    thread is waken up too frequently, by default its value is 1000.
>>
>> The I/O throttle happens when __bch_mca_scan() is called in the while-
>> loop inside the shrink kernel thread main loop.
>> - Every time when __bch_mca_scan() is called, 100 btree nodes are about
>>   to shrink. During __bch_mca_scan() shrinking all the btree nodes,
>>   c->bucket_lock is held.
>> - When a write request coming, bcache needs to allocate a SSD space for
>>   the cached data with c->bucket_lock held. If __bch_mca_scan() is
>>   executing to shrink btree node memory, the allocation operation has to
>>   wait for c->bucket_lock.
>> - When allocating in-memory btree node for new I/O request, mutex
>>   c->bucket_lock is also required. If __bch_mca_scan() is running
>>   new I/O request has to wait until the above 100 btree nodes are
>>   shunk, then the new I/O request has chance to compete c->bucket_lock.
>> Once c->bucket_lock is acquired inside shrink kernel thread, all other
>> I/Os has to wait until the 100 in-memory btree node cache are shunk.
>> Then the I/O requests are throttled by shrink kernel thread until all
>> in-memory cached btree node number is less than,
>> 	c->btree_cache_threshold - MCA_SHRINK_DEFAULT_HYSTERESIS
>>
>> This is a simple but working method to limit bcache btree node cache
>> memory consumption by I/O throttle.
>>
>> By default c->btree_cache_threshold is 15000, if the btree node size is
>> 2MB (default bucket size), it is around 30GB. It means if the bcache
>> btree node cache accupies around 30GB memory, the shrink kernel thread
>> will wake up and start to shrink the bcache in-memory btree node cache.
>>
>> 30GB in-memory btree node cache is already big enough, it is reasonable
>> for a default threshold. If there is report in fture that people do want
>> to offer much more memory for bcache in-memory btree node cache, there
>> will be a sysfs interface later for such configuration.
> 
> Is there already a sysfs that shows what is currently used by the cache?

Yes, it is /sys/fs/bcache/<UUID>/btree_cache_size, it shows amount of
memory occupied by in-memory btree node cache.

So far I don't see the out-of-memory report from others, it seems only
myself notices and observes such issue. Therefore I am not sure whether
an extra interface for btree cache size threshold is necessary (we have
had a lot already).


[snipped]

-- 

Coly Li



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Linux ARM Kernel]     [Linux Filesystem Development]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux