On Wed, May 26, 2021 at 11:14:50PM +0800, Coly Li wrote: > In the cache missing code path of cached device, if a proper location > from the internal B+ tree is matched for a cache miss range, function > cached_dev_cache_miss() will be called in cache_lookup_fn() in the > following code block, > [code block 1] > 526 unsigned int sectors = KEY_INODE(k) == s->iop.inode > 527 ? min_t(uint64_t, INT_MAX, > 528 KEY_START(k) - bio->bi_iter.bi_sector) > 529 : INT_MAX; > 530 int ret = s->d->cache_miss(b, s, bio, sectors); > > Here s->d->cache_miss() is the call backfunction pointer initialized as > cached_dev_cache_miss(), the last parameter 'sectors' is an important > hint to calculate the size of read request to backing device of the > missing cache data. > > Current calculation in above code block may generate oversized value of > 'sectors', which consequently may trigger 2 different potential kernel > panics by BUG() or BUG_ON() as listed below, > > 1) BUG_ON() inside bch_btree_insert_key(), > [code block 2] > 886 BUG_ON(b->ops->is_extents && !KEY_SIZE(k)); > 2) BUG() inside biovec_slab(), > [code block 3] > 51 default: > 52 BUG(); > 53 return NULL; > > All the above panics are original from cached_dev_cache_miss() by the > oversized parameter 'sectors'. > > Inside cached_dev_cache_miss(), parameter 'sectors' is used to calculate > the size of data read from backing device for the cache missing. This > size is stored in s->insert_bio_sectors by the following lines of code, > [code block 4] > 909 s->insert_bio_sectors = min(sectors, bio_sectors(bio) + reada); > > Then the actual key inserting to the internal B+ tree is generated and > stored in s->iop.replace_key by the following lines of code, > [code block 5] > 911 s->iop.replace_key = KEY(s->iop.inode, > 912 bio->bi_iter.bi_sector + s->insert_bio_sectors, > 913 s->insert_bio_sectors); > The oversized parameter 'sectors' may trigger panic 1) by BUG_ON() from > the above code block. > > And the bio sending to backing device for the missing data is allocated > with hint from s->insert_bio_sectors by the following lines of code, > [code block 6] > 926 cache_bio = bio_alloc_bioset(GFP_NOWAIT, > 927 DIV_ROUND_UP(s->insert_bio_sectors, PAGE_SECTORS), > 928 &dc->disk.bio_split); > The oversized parameter 'sectors' may trigger panic 2) by BUG() from the > agove code block. > > Now let me explain how the panics happen with the oversized 'sectors'. > In code block 5, replace_key is generated by macro KEY(). From the > definition of macro KEY(), > [code block 7] > 71 #define KEY(inode, offset, size) \ > 72 ((struct bkey) { \ > 73 .high = (1ULL << 63) | ((__u64) (size) << 20) | (inode), \ > 74 .low = (offset) \ > 75 }) > > Here 'size' is 16bits width embedded in 64bits member 'high' of struct > bkey. But in code block 1, if "KEY_START(k) - bio->bi_iter.bi_sector" is > very probably to be larger than (1<<16) - 1, which makes the bkey size > calculation in code block 5 is overflowed. In one bug report the value > of parameter 'sectors' is 131072 (= 1 << 17), the overflowed 'sectors' > results the overflowed s->insert_bio_sectors in code block 4, then makes > size field of s->iop.replace_key to be 0 in code block 5. Then the 0- > sized s->iop.replace_key is inserted into the internal B+ tree as cache > missing check key (a special key to detect and avoid a racing between > normal write request and cache missing read request) as, > [code block 8] > 915 ret = bch_btree_insert_check_key(b, &s->op, &s->iop.replace_key); > > Then the 0-sized s->iop.replace_key as 3rd parameter triggers the bkey > size check BUG_ON() in code block 2, and causes the kernel panic 1). > > Another kernel panic is from code block 6, is by the bvecs number > oversized value s->insert_bio_sectors from code block 4, > min(sectors, bio_sectors(bio) + reada) > There are two possibility for oversized reresult, > - bio_sectors(bio) is valid, but bio_sectors(bio) + reada is oversized. > - sectors < bio_sectors(bio) + reada, but sectors is oversized. > > >From a bug report the result of "DIV_ROUND_UP(s->insert_bio_sectors, > PAGE_SECTORS)" from code block 6 can be 344, 282, 946, 342 and many > other values which larther than BIO_MAX_VECS (a.k.a 256). When calling > bio_alloc_bioset() with such larger-than-256 value as the 2nd parameter, > this value will eventually be sent to biovec_slab() as parameter > 'nr_vecs' in following code path, > bio_alloc_bioset() ==> bvec_alloc() ==> biovec_slab() > Because parameter 'nr_vecs' is larger-than-256 value, the panic by BUG() > in code block 3 is triggered inside biovec_slab(). > > >From the above analysis, we know that the 4th parameter 'sector' sent > into cached_dev_cache_miss() may cause overflow in code block 5 and 6, > and finally cause kernel panic in code block 2 and 3. And if result of > bio_sectors(bio) + reada exceeds valid bvecs number, it may also trigger > kernel panic in code block 3 from code block 6. > > In this patch, the above two panics are avoided by the following > changes, > - If DIV_ROUND_UP(bio_sectors(bio) + reada, PAGE_SECTORS) exceeds the > maximum bvecs counter, reduce reada to make sure the DIV_ROUND_UP() > result won't generate a oversized s->insert_bio_sectors to cause > invalid bvecs number to cache_bio. > - If sectors exceeds the maximum bkey size, then set the maximum valid > bkey size to sectors. > > By the above changes, in code block 5 the size value in KEY() macro will > always be in valid range. As well in code block 6, the nr_iovecs > parameter of bio_alloc_bioset() calculated by > DIV_ROUND_UP(s->insert_bio_sectors, PAGE_SECTORS) will always be a valid > bvecs number. Now both panics won't happen anymore. > > Current problmatic code can be partially found since Linux v5.13-rc1, > therefore all maintained stable kernels should try to apply this fix. > > Reported-by: Diego Ercolani <diego.ercolani@xxxxxxxxx> > Reported-by: Jan Szubiak <jan.szubiak@xxxxxxxxxxxxxx> > Reported-by: Marco Rebhan <me@xxxxxxxxxxxx> > Reported-by: Matthias Ferdinand <bcache@xxxxxxxxx> > Reported-by: Thorsten Knabe <linux@xxxxxxxxxxxxxxxxx> > Reported-by: Victor Westerhuis <victor@xxxxxxxxxxx> > Reported-by: Vojtech Pavlik <vojtech@xxxxxxx> > Signed-off-by: Coly Li <colyli@xxxxxxx> > Cc: stable@xxxxxxxxxxxxxxx > Cc: Christoph Hellwig <hch@xxxxxx> > Cc: Kent Overstreet <kent.overstreet@xxxxxxxxx> > Cc: Takashi Iwai <tiwai@xxxxxxxx> > --- > Changelog: > v4, not directly access BIO_MAX_VECS and reduce reada value to avoid > oversized bvecs number, by hint from Christoph Hellwig. > v3, fix typo in v2. > v2, fix the bypass bio size calculation in v1. > v1, the initial version > > drivers/md/bcache/request.c | 19 +++++++++++++++++++ > 1 file changed, 19 insertions(+) > > diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c > index 29c231758293..054948f037ed 100644 > --- a/drivers/md/bcache/request.c > +++ b/drivers/md/bcache/request.c > @@ -883,6 +883,7 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s, > unsigned int reada = 0; > struct cached_dev *dc = container_of(s->d, struct cached_dev, disk); > struct bio *miss, *cache_bio; > + unsigned int nr_bvecs, max_segs; > > s->cache_missed = 1; > > @@ -899,6 +900,24 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s, > get_capacity(bio->bi_bdev->bd_disk) - > bio_end_sector(bio)); > > + /* > + * If "bio_sectors(bio) + reada" may causes an oversized bio bvecs > + * number, reada size must be deducted to make sure the following > + * calculated s->insert_bio_sectors won't cause oversized bvecs number > + * to cache_bio. > + */ > + nr_bvecs = DIV_ROUND_UP(bio_sectors(bio) + reada, PAGE_SECTORS); Can't this overflow if bio_sectors(bio) is close to UINT_MAX already? > + /* > + * Make sure sectors won't exceed (1 << KEY_SIZE_BITS) - 1, which is > + * the maximum bkey size in unit of sector. Then s->insert_bio_sectors > + * will always be a valid bio in valid bkey size range. > + */ > + if (sectors > ((1 << KEY_SIZE_BITS) - 1)) > + sectors = (1 << KEY_SIZE_BITS) - 1; This should use min() or min_t().