On 2020/1/7 7:08 上午, Eric Wheeler wrote: > On Tue, 7 Jan 2020, Coly Li wrote: > >> Now most of obvious deadlock or panic problems in bcache are fixed, so >> bcache now can survie under very high I/O load until ... it consumpts >> all system memory for bcache btree node cache, and the system hangs or >> panics. >> >> The bcache btree node cache is used to cache the bcace btree node from >> SSD to memory, when determine whether a data block is cached or not >> by indexing the btree, the speed can be much more fast by an in-memory >> search. >> >> Before this patch, there is no btree node cache memory limitation, just >> a slab shrinker callback registered to slab memory manager. If the I/O >> requests are coming faster than kernel memory management code to shrink >> the bcache btree node cache memory, it is possible for bcache to consume >> all available system memory for its btree node cache, and make whole >> system hang or panic. On high performance machine with many CPU cores, >> large memory size and SSD capanicity, it is often observed the whole >> system gets hung or panic afer 12+ hours high small random I/O load >> (e.g. 30K IOPS with 512 bytes random reads and writes). >> >> This patch tries to limit bcache btree node cache memory consumption by >> I/O throttle. The idea is, >> 1) Add kernel thread c->btree_cache_shrink_thread to shrink in-memory >> cached btree nodes. >> 2) Add a threshold c->btree_cache_threshold to limit number of in-memory >> cached btree node, if c->btree_cache_used reaches the threshold, wake >> up the shrink kernel thread to shrink in-memory cached btree nodes. >> 3) In the shrink kernel thread, call __bch_mca_scan() with reap_flush >> parameter set to true, then all candidate clean and dirty (flush to >> SSD before reap) btree nodes will be shrunk. >> 4) In the shrink kernel thread main loop, try to shrink 100 btree nodes >> by calling __bch_mca_scan(). Inside __bch_mca_scan() c->bucket_lock >> is held during shrinking all the 100 btree nodes. The shrinking will >> stop until in-memory cached btree node number less than >> c->btree_cache_threshold - MCA_SHRINK_DEFAULT_HYSTERESIS >> MCA_SHRINK_DEFAULT_HYSTERESIS is used to avoid the shrinking kernel >> thread is waken up too frequently, by default its value is 1000. >> >> The I/O throttle happens when __bch_mca_scan() is called in the while- >> loop inside the shrink kernel thread main loop. >> - Every time when __bch_mca_scan() is called, 100 btree nodes are about >> to shrink. During __bch_mca_scan() shrinking all the btree nodes, >> c->bucket_lock is held. >> - When a write request coming, bcache needs to allocate a SSD space for >> the cached data with c->bucket_lock held. If __bch_mca_scan() is >> executing to shrink btree node memory, the allocation operation has to >> wait for c->bucket_lock. >> - When allocating in-memory btree node for new I/O request, mutex >> c->bucket_lock is also required. If __bch_mca_scan() is running >> new I/O request has to wait until the above 100 btree nodes are >> shunk, then the new I/O request has chance to compete c->bucket_lock. >> Once c->bucket_lock is acquired inside shrink kernel thread, all other >> I/Os has to wait until the 100 in-memory btree node cache are shunk. >> Then the I/O requests are throttled by shrink kernel thread until all >> in-memory cached btree node number is less than, >> c->btree_cache_threshold - MCA_SHRINK_DEFAULT_HYSTERESIS >> >> This is a simple but working method to limit bcache btree node cache >> memory consumption by I/O throttle. >> >> By default c->btree_cache_threshold is 15000, if the btree node size is >> 2MB (default bucket size), it is around 30GB. It means if the bcache >> btree node cache accupies around 30GB memory, the shrink kernel thread >> will wake up and start to shrink the bcache in-memory btree node cache. >> >> 30GB in-memory btree node cache is already big enough, it is reasonable >> for a default threshold. If there is report in fture that people do want >> to offer much more memory for bcache in-memory btree node cache, there >> will be a sysfs interface later for such configuration. > > Is there already a sysfs that shows what is currently used by the cache? Yes, it is /sys/fs/bcache/<UUID>/btree_cache_size, it shows amount of memory occupied by in-memory btree node cache. So far I don't see the out-of-memory report from others, it seems only myself notices and observes such issue. Therefore I am not sure whether an extra interface for btree cache size threshold is necessary (we have had a lot already). [snipped] -- Coly Li