Re: [PATCH V2] mkfs: increase the minimum log size to 64MB when possible

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Mon, 4 Apr 2022 17:55:27 -0700

On Mon, Apr 04, 2022 at 06:08:28PM -0500, Eric Sandeen wrote:
> Recently, the upstream maintainers have been taking a lot of heat on
> account of writer threads encountering high latency when asking for log
> grant space when the log is small.  The reported use case is a heavily
> threaded indexing product logging trace information to a filesystem
> ranging in size between 20 and 250GB.  The meetings that result from the
> complaints about latency and stall warnings in dmesg both from this use
> case and also a large well known cloud product are now consuming 25% of
> the maintainer's weekly time and have been for months.
> 
> For small filesystems, the log is small by default because we have
> defaulted to a ratio of 1:2048 (or even less).  For grown filesystems,
> this is even worse, because big filesystems generate big metadata.
> However, the log size is still insufficient even if it is formatted at
> the larger size.
> 
> On a 220GB filesystem, the 99.95% latencies observed with a 200-writer
> file synchronous append workload running on a 44-AG filesystem (with 44
> CPUs) spread across 4 hard disks showed:
> 
> 	99.5%
> Log(MB)	Latency(ms)	BW (MB/s)	xlog_grant_head_wait
> 10	520		243		1875
> 20	220		308		540
> 40	140		360		6
> 80	92		363		0
> 160	86		364		0
> 
> For 4 NVME, the results were:
> 
> 10	201		409		898
> 20	177		488		144
> 40	122		550		0
> 80	120		549		0
> 160	121		545		0
> 
> This shows pretty clearly that we could reduce the amount of time that
> threads spend waiting on the XFS log by increasing the log size to at
> least 40MB regardless of size.  We then repeated the benchmark with a
> cloud system and an old machine to see if there were any ill effects on
> less stable hardware.
> 
> For cloudy iscsi block storage, the results were:
> 
> 10	390		176		2584
> 20	173		186		357
> 40	37		187		0
> 80	40		183		0
> 160	37		183		0
> 
> A decade-old machine w/ 24 CPUs and a giant spinning disk RAID6 array
> produced this:
> 
> 10	55		5.4		0
> 20	40		5.9		0
> 40	62		5.7		0
> 80	66		5.7		0
> 160	25		5.4		0
> 
> From the first three scenarios, it is clear that there are gains to be
> had by sizing the log somewhere between 40 and 80MB -- the long tail
> latency drops quite a bit, and programs are no longer blocking on the
> log's transaction space grant heads.  Split the difference and set the
> log size floor to 64MB.
> 
> Inspired-by: Darrick J. Wong <djwong@xxxxxxxxxx>
> Commit-log-stolen-from: Darrick J. Wong <djwong@xxxxxxxxxx>
> Signed-off-by: Eric Sandeen <sandeen@xxxxxxxxxx>
> ---
> 
> This is reworked, with dependencies on other patches removed; details in
> followup emails.
> 
> diff --git a/include/xfs_multidisk.h b/include/xfs_multidisk.h
> index a16a9fe2..ef4443b0 100644
> --- a/include/xfs_multidisk.h
> +++ b/include/xfs_multidisk.h
> @@ -17,8 +17,6 @@
>  #define	XFS_MIN_INODE_PERBLOCK	2		/* min inodes per block */
>  #define	XFS_DFL_IMAXIMUM_PCT	25		/* max % of space for inodes */
>  #define	XFS_MIN_REC_DIRSIZE	12		/* 4096 byte dirblocks (V2) */
> -#define	XFS_DFL_LOG_FACTOR	5		/* default log size, factor */
> -						/* with max trans reservation */
>  #define XFS_MAX_INODE_SIG_BITS	32		/* most significant bits in an
>  						 * inode number that we'll
>  						 * accept w/o warnings
> diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
> index 96682f9a..e36c1083 100644
> --- a/mkfs/xfs_mkfs.c
> +++ b/mkfs/xfs_mkfs.c
> @@ -18,6 +18,14 @@
>  #define GIGABYTES(count, blog)	((uint64_t)(count) << (30 - (blog)))
>  #define MEGABYTES(count, blog)	((uint64_t)(count) << (20 - (blog)))
>  
> +/*
> + * Realistically, the log should never be smaller than 64MB.  Studies by the
> + * kernel maintainer in early 2022 have shown a dramatic reduction in long tail
> + * latency of the xlog grant head waitqueue when running a heavy metadata
> + * update workload when the log size is at least 64MB.
> + */
> +#define XFS_MIN_REALISTIC_LOG_BLOCKS(blog)	(MEGABYTES(64, (blog)))
> +
>  /*
>   * Use this macro before we have superblock and mount structure to
>   * convert from basic blocks to filesystem blocks.
> @@ -3266,7 +3274,7 @@ calculate_log_size(
>  	struct xfs_mount	*mp)
>  {
>  	struct xfs_sb		*sbp = &mp->m_sb;
> -	int			min_logblocks;
> +	int			min_logblocks;	/* absolute minimum */
>  	struct xfs_mount	mount;
>  
>  	/* we need a temporary mount to calculate the minimum log size. */
> @@ -3308,28 +3316,17 @@ _("external log device size %lld blocks too small, must be at least %lld blocks\
>  
>  	/* internal log - if no size specified, calculate automatically */
>  	if (!cfg->logblocks) {
> -		if (cfg->dblocks < GIGABYTES(1, cfg->blocklog)) {
> -			/* tiny filesystems get minimum sized logs. */
> -			cfg->logblocks = min_logblocks;
> -		} else if (cfg->dblocks < GIGABYTES(16, cfg->blocklog)) {
> +		/* Use a 2048:1 fs:log ratio for most filesystems */
> +		cfg->logblocks = (cfg->dblocks << cfg->blocklog) / 2048;
> +		cfg->logblocks = cfg->logblocks >> cfg->blocklog;
>  
> -			/*
> -			 * For small filesystems, we want to use the
> -			 * XFS_MIN_LOG_BYTES for filesystems smaller than 16G if
> -			 * at all possible, ramping up to 128MB at 256GB.
> -			 */
> -			cfg->logblocks = min(XFS_MIN_LOG_BYTES >> cfg->blocklog,
> -					min_logblocks * XFS_DFL_LOG_FACTOR);
> -		} else {
> -			/*
> -			 * With a 2GB max log size, default to maximum size
> -			 * at 4TB. This keeps the same ratio from the older
> -			 * max log size of 128M at 256GB fs size. IOWs,
> -			 * the ratio of fs size to log size is 2048:1.
> -			 */
> -			cfg->logblocks = (cfg->dblocks << cfg->blocklog) / 2048;
> -			cfg->logblocks = cfg->logblocks >> cfg->blocklog;
> -		}
> +		/* But don't go below a reasonable size */
> +		cfg->logblocks = max(cfg->logblocks,
> +				XFS_MIN_REALISTIC_LOG_BLOCKS(cfg->blocklog));
> +
> +		/* And for a tiny filesystem, use the absolute minimum size */
> +		if (cfg->dblocks < MEGABYTES(512, cfg->blocklog))
> +			cfg->logblocks = min_logblocks;

Heh, I was going to apply this to any filesystem under 300MB (and then
cut everyone off at 300M) but I suppose if you'd rather set that at 512M
then I'm not going to complain... maybe we're better off not creating
absurd things like 20% of a tiny FS used for logs. :D

Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx>

--D

>  
>  		/* Ensure the chosen size meets minimum log size requirements */
>  		cfg->logblocks = max(min_logblocks, cfg->logblocks);
>