On Tue, 7 Jan 2020, Coly Li wrote: > On 2020/1/7 7:08 上午, Eric Wheeler wrote: > > On Tue, 7 Jan 2020, Coly Li wrote: > > > >> Now most of obvious deadlock or panic problems in bcache are fixed, so > >> bcache now can survie under very high I/O load until ... it consumpts > >> all system memory for bcache btree node cache, and the system hangs or > >> panics. > >> > >> The bcache btree node cache is used to cache the bcace btree node from > >> SSD to memory, when determine whether a data block is cached or not > >> by indexing the btree, the speed can be much more fast by an in-memory > >> search. > >> > >> Before this patch, there is no btree node cache memory limitation, just > >> a slab shrinker callback registered to slab memory manager. If the I/O > >> requests are coming faster than kernel memory management code to shrink > >> the bcache btree node cache memory, it is possible for bcache to consume > >> all available system memory for its btree node cache, and make whole > >> system hang or panic. On high performance machine with many CPU cores, > >> large memory size and SSD capanicity, it is often observed the whole > >> system gets hung or panic afer 12+ hours high small random I/O load > >> (e.g. 30K IOPS with 512 bytes random reads and writes). > >> > >> This patch tries to limit bcache btree node cache memory consumption by > >> I/O throttle. The idea is, > >> 1) Add kernel thread c->btree_cache_shrink_thread to shrink in-memory > >> cached btree nodes. > >> 2) Add a threshold c->btree_cache_threshold to limit number of in-memory > >> cached btree node, if c->btree_cache_used reaches the threshold, wake > >> up the shrink kernel thread to shrink in-memory cached btree nodes. > >> 3) In the shrink kernel thread, call __bch_mca_scan() with reap_flush > >> parameter set to true, then all candidate clean and dirty (flush to > >> SSD before reap) btree nodes will be shrunk. > >> 4) In the shrink kernel thread main loop, try to shrink 100 btree nodes > >> by calling __bch_mca_scan(). Inside __bch_mca_scan() c->bucket_lock > >> is held during shrinking all the 100 btree nodes. The shrinking will > >> stop until in-memory cached btree node number less than > >> c->btree_cache_threshold - MCA_SHRINK_DEFAULT_HYSTERESIS > >> MCA_SHRINK_DEFAULT_HYSTERESIS is used to avoid the shrinking kernel > >> thread is waken up too frequently, by default its value is 1000. > >> > >> The I/O throttle happens when __bch_mca_scan() is called in the while- > >> loop inside the shrink kernel thread main loop. > >> - Every time when __bch_mca_scan() is called, 100 btree nodes are about > >> to shrink. During __bch_mca_scan() shrinking all the btree nodes, > >> c->bucket_lock is held. > >> - When a write request coming, bcache needs to allocate a SSD space for > >> the cached data with c->bucket_lock held. If __bch_mca_scan() is > >> executing to shrink btree node memory, the allocation operation has to > >> wait for c->bucket_lock. > >> - When allocating in-memory btree node for new I/O request, mutex > >> c->bucket_lock is also required. If __bch_mca_scan() is running > >> new I/O request has to wait until the above 100 btree nodes are > >> shunk, then the new I/O request has chance to compete c->bucket_lock. > >> Once c->bucket_lock is acquired inside shrink kernel thread, all other > >> I/Os has to wait until the 100 in-memory btree node cache are shunk. > >> Then the I/O requests are throttled by shrink kernel thread until all > >> in-memory cached btree node number is less than, > >> c->btree_cache_threshold - MCA_SHRINK_DEFAULT_HYSTERESIS > >> > >> This is a simple but working method to limit bcache btree node cache > >> memory consumption by I/O throttle. > >> > >> By default c->btree_cache_threshold is 15000, if the btree node size is > >> 2MB (default bucket size), it is around 30GB. It means if the bcache > >> btree node cache accupies around 30GB memory, the shrink kernel thread > >> will wake up and start to shrink the bcache in-memory btree node cache. > >> > >> 30GB in-memory btree node cache is already big enough, it is reasonable > >> for a default threshold. If there is report in fture that people do want > >> to offer much more memory for bcache in-memory btree node cache, there > >> will be a sysfs interface later for such configuration. > > > > Is there already a sysfs that shows what is currently used by the cache? > > Yes, it is /sys/fs/bcache/<UUID>/btree_cache_size, it shows amount of > memory occupied by in-memory btree node cache. > > So far I don't see the out-of-memory report from others, it seems only > myself notices and observes such issue. Therefore I am not sure whether > an extra interface for btree cache size threshold is necessary (we have > had a lot already). FYI, these are some of our btree_cache_size's: ~]# cat /sys/fs/bcache/*/btree_cache_size 263.6M ~]# cat /sys/fs/bcache/*/btree_cache_size 174.2M ~]# cat /sys/fs/bcache/*/btree_cache_size 523.5M ~]# cat /sys/fs/bcache/*/btree_cache_size 464.2M ~]# cat /sys/fs/bcache/*/btree_cache_size 847.2M ~]# cat /sys/fs/bcache/*/btree_cache_size 864.0M ~]# cat /sys/fs/bcache/*/btree_cache_size 441.2M ~]# cat /sys/fs/bcache/*/btree_cache_size 1.2G -- Eric Wheeler > > > [snipped] > > -- > > Coly Li >