Re: [PATCH v4] bcache: avoid oversized read request in cache missing code path

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, May 26, 2021 at 11:14:50PM +0800, Coly Li wrote:
> In the cache missing code path of cached device, if a proper location
> from the internal B+ tree is matched for a cache miss range, function
> cached_dev_cache_miss() will be called in cache_lookup_fn() in the
> following code block,
> [code block 1]
>   526         unsigned int sectors = KEY_INODE(k) == s->iop.inode
>   527                 ? min_t(uint64_t, INT_MAX,
>   528                         KEY_START(k) - bio->bi_iter.bi_sector)
>   529                 : INT_MAX;
>   530         int ret = s->d->cache_miss(b, s, bio, sectors);
> 
> Here s->d->cache_miss() is the call backfunction pointer initialized as
> cached_dev_cache_miss(), the last parameter 'sectors' is an important
> hint to calculate the size of read request to backing device of the
> missing cache data.
> 
> Current calculation in above code block may generate oversized value of
> 'sectors', which consequently may trigger 2 different potential kernel
> panics by BUG() or BUG_ON() as listed below,
> 
> 1) BUG_ON() inside bch_btree_insert_key(),
> [code block 2]
>    886         BUG_ON(b->ops->is_extents && !KEY_SIZE(k));
> 2) BUG() inside biovec_slab(),
> [code block 3]
>    51         default:
>    52                 BUG();
>    53                 return NULL;
> 
> All the above panics are original from cached_dev_cache_miss() by the
> oversized parameter 'sectors'.
> 
> Inside cached_dev_cache_miss(), parameter 'sectors' is used to calculate
> the size of data read from backing device for the cache missing. This
> size is stored in s->insert_bio_sectors by the following lines of code,
> [code block 4]
>   909    s->insert_bio_sectors = min(sectors, bio_sectors(bio) + reada);
> 
> Then the actual key inserting to the internal B+ tree is generated and
> stored in s->iop.replace_key by the following lines of code,
> [code block 5]
>   911   s->iop.replace_key = KEY(s->iop.inode,
>   912                    bio->bi_iter.bi_sector + s->insert_bio_sectors,
>   913                    s->insert_bio_sectors);
> The oversized parameter 'sectors' may trigger panic 1) by BUG_ON() from
> the above code block.
> 
> And the bio sending to backing device for the missing data is allocated
> with hint from s->insert_bio_sectors by the following lines of code,
> [code block 6]
>   926    cache_bio = bio_alloc_bioset(GFP_NOWAIT,
>   927                 DIV_ROUND_UP(s->insert_bio_sectors, PAGE_SECTORS),
>   928                 &dc->disk.bio_split);
> The oversized parameter 'sectors' may trigger panic 2) by BUG() from the
> agove code block.
> 
> Now let me explain how the panics happen with the oversized 'sectors'.
> In code block 5, replace_key is generated by macro KEY(). From the
> definition of macro KEY(),
> [code block 7]
>   71 #define KEY(inode, offset, size)                                  \
>   72 ((struct bkey) {                                                  \
>   73      .high = (1ULL << 63) | ((__u64) (size) << 20) | (inode),     \
>   74      .low = (offset)                                              \
>   75 })
> 
> Here 'size' is 16bits width embedded in 64bits member 'high' of struct
> bkey. But in code block 1, if "KEY_START(k) - bio->bi_iter.bi_sector" is
> very probably to be larger than (1<<16) - 1, which makes the bkey size
> calculation in code block 5 is overflowed. In one bug report the value
> of parameter 'sectors' is 131072 (= 1 << 17), the overflowed 'sectors'
> results the overflowed s->insert_bio_sectors in code block 4, then makes
> size field of s->iop.replace_key to be 0 in code block 5. Then the 0-
> sized s->iop.replace_key is inserted into the internal B+ tree as cache
> missing check key (a special key to detect and avoid a racing between
> normal write request and cache missing read request) as,
> [code block 8]
>   915   ret = bch_btree_insert_check_key(b, &s->op, &s->iop.replace_key);
> 
> Then the 0-sized s->iop.replace_key as 3rd parameter triggers the bkey
> size check BUG_ON() in code block 2, and causes the kernel panic 1).
> 
> Another kernel panic is from code block 6, is by the bvecs number
> oversized value s->insert_bio_sectors from code block 4,
> 	min(sectors, bio_sectors(bio) + reada)
> There are two possibility for oversized reresult,
> - bio_sectors(bio) is valid, but bio_sectors(bio) + reada is oversized.
> - sectors < bio_sectors(bio) + reada, but sectors is oversized.
> 
> >From a bug report the result of "DIV_ROUND_UP(s->insert_bio_sectors,
> PAGE_SECTORS)" from code block 6 can be 344, 282, 946, 342 and many
> other values which larther than BIO_MAX_VECS (a.k.a 256). When calling
> bio_alloc_bioset() with such larger-than-256 value as the 2nd parameter,
> this value will eventually be sent to biovec_slab() as parameter
> 'nr_vecs' in following code path,
>    bio_alloc_bioset() ==> bvec_alloc() ==> biovec_slab()
> Because parameter 'nr_vecs' is larger-than-256 value, the panic by BUG()
> in code block 3 is triggered inside biovec_slab().
> 
> >From the above analysis, we know that the 4th parameter 'sector' sent
> into cached_dev_cache_miss() may cause overflow in code block 5 and 6,
> and finally cause kernel panic in code block 2 and 3. And if result of
> bio_sectors(bio) + reada exceeds valid bvecs number, it may also trigger
> kernel panic in code block 3 from code block 6.
> 
> In this patch, the above two panics are avoided by the following
> changes,
> - If DIV_ROUND_UP(bio_sectors(bio) + reada, PAGE_SECTORS) exceeds the
>   maximum bvecs counter, reduce reada to make sure the DIV_ROUND_UP()
>   result won't generate a oversized s->insert_bio_sectors to cause
>   invalid bvecs number to cache_bio.
> - If sectors exceeds the maximum bkey size, then set the maximum valid
>   bkey size to sectors.
> 
> By the above changes, in code block 5 the size value in KEY() macro will
> always be in valid range. As well in code block 6, the nr_iovecs
> parameter of bio_alloc_bioset() calculated by
> DIV_ROUND_UP(s->insert_bio_sectors, PAGE_SECTORS) will always be a valid
> bvecs number. Now both panics won't happen anymore.
> 
> Current problmatic code can be partially found since Linux v5.13-rc1,
> therefore all maintained stable kernels should try to apply this fix.
> 
> Reported-by: Diego Ercolani <diego.ercolani@xxxxxxxxx>
> Reported-by: Jan Szubiak <jan.szubiak@xxxxxxxxxxxxxx>
> Reported-by: Marco Rebhan <me@xxxxxxxxxxxx>
> Reported-by: Matthias Ferdinand <bcache@xxxxxxxxx>
> Reported-by: Thorsten Knabe <linux@xxxxxxxxxxxxxxxxx>
> Reported-by: Victor Westerhuis <victor@xxxxxxxxxxx>
> Reported-by: Vojtech Pavlik <vojtech@xxxxxxx>
> Signed-off-by: Coly Li <colyli@xxxxxxx>
> Cc: stable@xxxxxxxxxxxxxxx
> Cc: Christoph Hellwig <hch@xxxxxx>
> Cc: Kent Overstreet <kent.overstreet@xxxxxxxxx>
> Cc: Takashi Iwai <tiwai@xxxxxxxx>
> ---
> Changelog:
> v4, not directly access BIO_MAX_VECS and reduce reada value to avoid
>     oversized bvecs number, by hint from Christoph Hellwig. 
> v3, fix typo in v2.
> v2, fix the bypass bio size calculation in v1.
> v1, the initial version
> 
>  drivers/md/bcache/request.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 29c231758293..054948f037ed 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -883,6 +883,7 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
>  	unsigned int reada = 0;
>  	struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
>  	struct bio *miss, *cache_bio;
> +	unsigned int nr_bvecs, max_segs;
>  
>  	s->cache_missed = 1;
>  
> @@ -899,6 +900,24 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
>  			      get_capacity(bio->bi_bdev->bd_disk) -
>  			      bio_end_sector(bio));
>  
> +	/*
> +	 * If "bio_sectors(bio) + reada" may causes an oversized bio bvecs
> +	 * number, reada size must be deducted to make sure the following
> +	 * calculated s->insert_bio_sectors won't cause oversized bvecs number
> +	 * to cache_bio.
> +	 */
> +	nr_bvecs = DIV_ROUND_UP(bio_sectors(bio) + reada, PAGE_SECTORS);

Can't this overflow if bio_sectors(bio) is close to UINT_MAX already?

> +	/*
> +	 * Make sure sectors won't exceed (1 << KEY_SIZE_BITS) - 1, which is
> +	 * the maximum bkey size in unit of sector. Then s->insert_bio_sectors
> +	 * will always be a valid bio in valid bkey size range.
> +	 */
> +	if (sectors > ((1 << KEY_SIZE_BITS) - 1))
> +		sectors = (1 << KEY_SIZE_BITS) - 1;

This should use min() or min_t().



[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux