Re: Unusual value of optimal_io_size prevents bcache initialization

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 26 Sep 2023, Coly Li wrote:
> > 2023年9月24日 22:06,Coly Li <colyli@xxxxxxx> 写道:
> > 
> 
> [snipped]
> 
> >>> Maybe bcache should not directly use q->limits.io_opt as d->stripe_size,
> >>> it should be some value less than 1<<31 and aligned to optimal_io_size.
> >>> After the code got merged into kernel for 10+ years, it is time to improve
> >>> this calculation :-) >
> >> Yeah, one of the other doubts I had was exactly regarding this value and if it is still "actual" to calculate it that way. Unfortunately, I don't have the expertise to have an opinion on it. Would you be so kind to share your knowledge and let me understand why it is calculated this way and why is it using the optimal io size? Is it using it to "writeback" optimal-sized blokes?
> >> 
> > 
> > Most of the conditions when underlying hardware doesn’t declare its optimal io size, bcache uses 1<<31 as a default stripe size. It works fine for decade, so I will use it and make sure it is aligned to value of optimal io size. It should work fine. And I will compose a simple patch for this fix.
> > 
> >>>> Another consideration, stripe_sectors_dirty and full_dirty_stripes, the two
> >>>> arrays allocated using n, are being used just in writeback mode, is this
> >>>> correct? In my specific case, I'm not planning to use writeback mode so I
> >>>> would expect bcache to not even try to create those arrays. Or, at least, to
> >>>> not create them during initialization but just in case of a change in the
> >>>> working mode (i.e. write-through -> writeback).
> >>> Indeed, Mingzhe Zou (if I remember correctly) submitted a patch for this
> >>> idea, but it is blocked by other depending patches which are not finished
> >>> by me. Yes I like the idea to dynamically allocate/free d->stripe_sectors_dirty
> >>> and d->full_dirty_stripes when they are necessary. I hope I may help to make
> >>> the change go into upstream sooner.
> >>> I will post a patch for your testing.
> >> This would be great! Thank you very much! On the other side, if there's anything I can do to help I would be glad to contribute.
> > 
> > I will post a simple patch for the reported memory allocation failure for you to test soon.
> 
> 
> Hi Andrea,
> 
> Could you please try the attached patch and see whether it makes some difference? Thank you in advance.

> From: Coly Li <colyli@xxxxxxx>
> Date: Tue, 26 Sep 2023 20:13:19 +0800
> Subject: [PATCH] bcache: avoid oversize memory allocation by small stripe_size
> 
> Arraies bcache->stripe_sectors_dirty and bcache->full_dirty_stripes are
> used for dirty data writeback, their sizes are decided by backing device
> capacity and stripe size. Larger backing device capacity or smaller
> stripe size make these two arraies occupies more dynamic memory space.
> 
> Currently bcache->stripe_size is directly inherited from
> queue->limits.io_opt of underlying storage device. For normal hard
> drives, its limits.io_opt is 0, and bcache sets the corresponding
> stripe_size to 1TB (1<<31 sectors), it works fine 10+ years. But for
> devices do declare value for queue->limits.io_opt, small stripe_size
> (comparing to 1TB) becomes an issue for oversize memory allocations of
> bcache->stripe_sectors_dirty and bcache->full_dirty_stripes, while the
> capacity of hard drives gets much larger in recent decade.
> 
> For example a raid5 array assembled by three 20TB hardrives, the raid
> device capacity is 40TB with typical 512KB limits.io_opt. After the math
> calculation in bcache code, these two arraies will occupy 400MB dynamic
> memory. Even worse Andrea Tomassetti reports that a 4KB limits.io_opt is
> declared on a new 2TB hard drive, then these two arraies request 2GB and
> 512MB dynamic memory from kzalloc(). The result is that bcache device
> always fails to initialize on his system.
> 
> To avoid the oversize memory allocation, bcache->stripe_size should not
> directly inherited by queue->limits.io_opt from the underlying device.
> This patch defines BCH_MIN_STRIPE_SZ (4MB) as minimal bcache stripe size
> and set bcache device's stripe size against the declared limits.io_opt
> value from the underlying storage device,
> - If the declared limits.io_opt > BCH_MIN_STRIPE_SZ, bcache device will
>   set its stripe size directly by this limits.io_opt value.
> - If the declared limits.io_opt < BCH_MIN_STRIPE_SZ, bcache device will
>   set its stripe size by a value multiplying limits.io_opt and euqal or
>   large than BCH_MIN_STRIPE_SZ.
> 
> Then the minimal stripe size of a bcache device will always be >= 4MB.
> For a 40TB raid5 device with 512KB limits.io_opt, memory occupied by
> bcache->stripe_sectors_dirty and bcache->full_dirty_stripes will be 50MB
> in total. For a 2TB hard drive with 4KB limits.io_opt, memory occupied
> by these two arraies will be 2.5MB in total.

This will create expensive unaligned writes on RAID5/6 arrays for most
cases.  For example, if the stripe size of 6 disks with 64 k chunks has
a size of 384 k, then when you round up to an even value of 4MB
you will introduce a read-copy-write behavior since 384KB
does not divide evenly into 4MB (4MB/384KB =~ 10.667).

The best way to handle this would be to Use 4 megabytes as a minimum,
but round up to a multiple of the value in limits.io_opt.

Here is a real-world example of a non-power-of-2 io_opt value:

	]# cat /sys/block/sdc/queue/optimal_io_size 
	196608

More below:

> Such mount of memory allocated for bcache->stripe_sectors_dirty and
> bcache->full_dirty_stripes is reasonable for most of storage devices.
> 
> Reported-by: Andrea Tomassetti <andrea.tomassetti-opensource@xxxxxxxx>
> Signed-off-by: Coly Li <colyli@xxxxxxx>
> Cc: Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx>
> ---
>  drivers/md/bcache/bcache.h | 1 +
>  drivers/md/bcache/super.c  | 2 ++
>  2 files changed, 3 insertions(+)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 5a79bb3c272f..83eb7f27db3d 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -265,6 +265,7 @@ struct bcache_device {
>  #define BCACHE_DEV_WB_RUNNING		3
>  #define BCACHE_DEV_RATE_DW_RUNNING	4
>  	int			nr_stripes;
> +#define BCH_MIN_STRIPE_SZ		((4 << 20) >> SECTOR_SHIFT)
>  	unsigned int		stripe_size;
>  	atomic_t		*stripe_sectors_dirty;
>  	unsigned long		*full_dirty_stripes;
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 0ae2b3676293..0eb71543d773 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -905,6 +905,8 @@ static int bcache_device_init(struct bcache_device *d, unsigned int block_size,
>  
>  	if (!d->stripe_size)
>  		d->stripe_size = 1 << 31;
> +	else if (d->stripe_size < BCH_MIN_STRIPE_SZ)
> +		d->stripe_size = round_up(BCH_MIN_STRIPE_SZ, d->stripe_size);

I think you want "roundup" (not "round_up") to solve alignment problem:

+		d->stripe_size = roundup(BCH_MIN_STRIPE_SZ, d->stripe_size);
  
Both roundup() and round_up() are defined in math.h:

  * round_up - round up to next specified power of 2
  * roundup - round up to the next specified multiple 

	https://elixir.bootlin.com/linux/v6.0/source/include/linux/math.h#L17

-Eric

>  	n = DIV_ROUND_UP_ULL(sectors, d->stripe_size);
>  	if (!n || n > max_stripes) {
> -- 
> 2.35.3
> 

--
Eric Wheeler


> 
> Coly Li
> 
> 

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Linux ARM Kernel]     [Linux Filesystem Development]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux