Hi folks, Please hold on, this patch fails on a large size read test case this morning. I will post a new version after the issue fixed. Thanks. Coly Li On 5/18/21 12:22 AM, colyli@xxxxxxx wrote: > From: Coly Li <colyli@xxxxxxx> > > In the cache missing code path of cached device, if a proper location > from the internal B+ tree is matched for a cache miss range, function > cached_dev_cache_miss() will be called in cache_lookup_fn() in the > following code block, > [code block 1] > 526 unsigned int sectors = KEY_INODE(k) == s->iop.inode > 527 ? min_t(uint64_t, INT_MAX, > 528 KEY_START(k) - bio->bi_iter.bi_sector) > 529 : INT_MAX; > 530 int ret = s->d->cache_miss(b, s, bio, sectors); > > Here s->d->cache_miss() is the call backfunction pointer initialized as > cached_dev_cache_miss(), the last parameter 'sectors' is an important > hint to calculate the size of read request to backing device of the > missing cache data. > > Current calculation in above code block may generate oversized value of > 'sectors', which consequently may trigger 2 different potential kernel > panics by BUG() or BUG_ON() as listed below, > > 1) BUG_ON() inside bch_btree_insert_key(), > [code block 2] > 886 BUG_ON(b->ops->is_extents && !KEY_SIZE(k)); > 2) BUG() inside biovec_slab(), > [code block 3] > 51 default: > 52 BUG(); > 53 return NULL; > > All the above panics are original from cached_dev_cache_miss() by the > oversized parameter 'sectors'. > > Inside cached_dev_cache_miss(), parameter 'sectors' is used to calculate > the size of data read from backing device for the cache missing. This > size is stored in s->insert_bio_sectors by the following lines of code, > [code block 4] > 909 s->insert_bio_sectors = min(sectors, bio_sectors(bio) + reada); > > Then the actual key inserting to the internal B+ tree is generated and > stored in s->iop.replace_key by the following lines of code, > [code block 5] > 911 s->iop.replace_key = KEY(s->iop.inode, > 912 bio->bi_iter.bi_sector + s->insert_bio_sectors, > 913 s->insert_bio_sectors); > The oversized parameter 'sectors' may trigger panic 1) by BUG_ON() from > the above code block. > > And the bio sending to backing device for the missing data is allocated > with hint from s->insert_bio_sectors by the following lines of code, > [code block 6] > 926 cache_bio = bio_alloc_bioset(GFP_NOWAIT, > 927 DIV_ROUND_UP(s->insert_bio_sectors, PAGE_SECTORS), > 928 &dc->disk.bio_split); > The oversized parameter 'sectors' may trigger panic 2) by BUG() from the > agove code block. > > Now let me explain how the panics happen with the oversized 'sectors'. > In code block 5, replace_key is generated by macro KEY(). From the > definition of macro KEY(), > [code block 7] > 71 #define KEY(inode, offset, size) \ > 72 ((struct bkey) { \ > 73 .high = (1ULL << 63) | ((__u64) (size) << 20) | (inode), \ > 74 .low = (offset) \ > 75 }) > > Here 'size' is 16bits width embedded in 64bits member 'high' of struct > bkey. But in code block 1, if "KEY_START(k) - bio->bi_iter.bi_sector" is > very probably to be larger than (1<<16) - 1, which makes the bkey size > calculation in code block 5 is overflowed. In one bug report the value > of parameter 'sectors' is 131072 (= 1 << 17), the overflowed 'sectors' > results the overflowed s->insert_bio_sectors in code block 4, then makes > size field of s->iop.replace_key to be 0 in code block 5. Then the 0- > sized s->iop.replace_key is inserted into the internal B+ tree as cache > missing check key (a special key to detect and avoid a racing between > normal write request and cache missing read request) as, > [code block 8] > 915 ret = bch_btree_insert_check_key(b, &s->op, &s->iop.replace_key); > > Then the 0-sized s->iop.replace_key as 3rd parameter triggers the bkey > size check BUG_ON() in code block 2, and causes the kernel panic 1). > > Another kernel panic is from code block 6, is from the oversized value > s->insert_bio_sectors resulted by the oversized 'sectors'. From a bug > report the result of "DIV_ROUND_UP(s->insert_bio_sectors, PAGE_SECTORS)" > from code block 6 can be 344, 282, 946, 342 and many other values which > larther than BIO_MAX_VECS (a.k.a 256). When calling bio_alloc_bioset() > with such larger-than-256 value as the 2nd parameter, this value will > eventually be sent to biovec_slab() as parameter 'nr_vecs' in following > code path, > bio_alloc_bioset() ==> bvec_alloc() ==> biovec_slab() > > Because parameter 'nr_vecs' is larger-than-256 value, the panic by BUG() > in code block 3 is triggered inside biovec_slab(). > > From the above analysis, it is obvious that in order to avoid the kernel > panics in code block 2 and code block 3, the calculated 'sectors' in > code block 1 should not generate overflowed value in code block 5 and > code block 6. > > To avoid overflow in code block 5, the maximum 'sectors' value should be > equal or less than (1 << KEY_SIZE_BITS) - 1. And to avoid overflow in > code block 6, the maximum 'sectors' value should be euqal or less than > BIO_MAX_VECS * PAGE_SECTORS. Considering the kernel page size can be > variable, a reasonable maximum limitation of 'sectors' in code block 1 > should be smaller on from "(1 << KEY_SIZE_BITS) - 1" and "BIO_MAX_VECS * > PAGE_SECTORS". > > In this patch, a local variable inside cache_lookup_fn() is added as, > max_cache_miss_size = min_t(uint32_t, > (1 << KEY_SIZE_BITS) - 1, BIO_MAX_VECS * PAGE_SECTORS); > Then code block 1 uses max_cache_miss_size to limit the maximum value of > 'sectors' calculation as, > 533 unsigned int sectors = KEY_INODE(k) == s->iop.inode > 534 ? min_t(uint64_t, max_cache_miss_size, > 535 KEY_START(k) - bio->bi_iter.bi_sector) > 536 : max_cache_miss_size; > > Now the calculated 'sectors' value sent into cached_dev_cache_miss() > won't trigger neither of the above kernel panics. > > Current problmatic code can be partially found since Linux v5.13-rc1, > therefore all maintained stable kernels should try to apply this fix. > > Reported-by: Diego Ercolani <diego.ercolani@xxxxxxxxx> > Reported-by: Jan Szubiak <jan.szubiak@xxxxxxxxxxxxxx> > Reported-by: Marco Rebhan <me@xxxxxxxxxxxx> > Reported-by: Matthias Ferdinand <bcache@xxxxxxxxx> > Reported-by: Thorsten Knabe <linux@xxxxxxxxxxxxxxxxx> > Reported-by: Victor Westerhuis <victor@xxxxxxxxxxx> > Reported-by: Vojtech Pavlik <vojtech@xxxxxxx> > Signed-off-by: Coly Li <colyli@xxxxxxx> > Cc: stable@xxxxxxxxxxxxxxx > Cc: Takashi Iwai <tiwai@xxxxxxxx> > Cc: Kent Overstreet <kent.overstreet@xxxxxxxxx> > --- > drivers/md/bcache/request.c | 15 +++++++++++---- > 1 file changed, 11 insertions(+), 4 deletions(-) > > diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c > index 29c231758293..90fa9ac47661 100644 > --- a/drivers/md/bcache/request.c > +++ b/drivers/md/bcache/request.c > @@ -515,18 +515,25 @@ static int cache_lookup_fn(struct btree_op *op, struct btree *b, struct bkey *k) > struct search *s = container_of(op, struct search, op); > struct bio *n, *bio = &s->bio.bio; > struct bkey *bio_key; > - unsigned int ptr; > + unsigned int ptr, max_cache_miss_size; > > if (bkey_cmp(k, &KEY(s->iop.inode, bio->bi_iter.bi_sector, 0)) <= 0) > return MAP_CONTINUE; > > + /* > + * Make sure the cache missing size won't exceed the restrictions of > + * max bkey size and max bio's bi_max_vecs. > + */ > + max_cache_miss_size = min_t(uint32_t, > + (1 << KEY_SIZE_BITS) - 1, BIO_MAX_VECS * PAGE_SECTORS); > + > if (KEY_INODE(k) != s->iop.inode || > KEY_START(k) > bio->bi_iter.bi_sector) { > unsigned int bio_sectors = bio_sectors(bio); > unsigned int sectors = KEY_INODE(k) == s->iop.inode > - ? min_t(uint64_t, INT_MAX, > + ? min_t(uint64_t, max_cache_miss_size, > KEY_START(k) - bio->bi_iter.bi_sector) > - : INT_MAX; > + : max_cache_miss_size; > int ret = s->d->cache_miss(b, s, bio, sectors); > > if (ret != MAP_CONTINUE) > @@ -547,7 +554,7 @@ static int cache_lookup_fn(struct btree_op *op, struct btree *b, struct bkey *k) > if (KEY_DIRTY(k)) > s->read_dirty_data = true; > > - n = bio_next_split(bio, min_t(uint64_t, INT_MAX, > + n = bio_next_split(bio, min_t(uint64_t, max_cache_miss_size, > KEY_OFFSET(k) - bio->bi_iter.bi_sector), > GFP_NOIO, &s->d->bio_split); > >