Re: [PATCH] xfs: cache minimum realtime summary level

Omar Sandoval <osandov@xxxxxxxxxxx> · Sat, 10 Nov 2018 19:59:22 -0800

On Wed, Nov 07, 2018 at 07:58:51AM +1100, Dave Chinner wrote:
> On Tue, Nov 06, 2018 at 09:20:37AM -0800, Omar Sandoval wrote:
> > On Tue, Nov 06, 2018 at 10:49:48AM +1100, Dave Chinner wrote:
> > > On Fri, Nov 02, 2018 at 12:38:00PM -0700, Omar Sandoval wrote:
> > > > From: Omar Sandoval <osandov@xxxxxx>
> > > > 
> > > > The realtime summary is a two-dimensional array on disk, effectively:
> > > > 
> > > > u32 rsum[log2(number of realtime extents) + 1][number of blocks in the bitmap]
> > > > 
> > > > rsum[log][bbno] is the number of extents of size 2**log which start in
> > > > bitmap block bbno.
> > > > 
> > > > xfs_rtallocate_extent_near() uses xfs_rtany_summary() to check whether
> > > > rsum[log][bbno] != 0 for any log level. However, the summary array is
> > > > stored in row-major order (i.e., like an array in C), so all of these
> > > > entries are not adjacent, but rather spread across the entire summary
> > > > file. In the worst case (a full bitmap block), xfs_rtany_summary() has
> > > > to check every level.
> > > > 
> > > > This means that on a moderately-used realtime device, an allocation will
> > > > waste a lot of time finding, reading, and releasing buffers for the
> > > > realtime summary. In particular, one of our storage services (which runs
> > > > on servers with 8 very slow CPUs and 15 8 TB XFS realtime filesystems)
> > > > spends almost 5% of its CPU cycles in xfs_rtbuf_get() and
> > > > xfs_trans_brelse() called from xfs_rtany_summary().
> > > 
> > > Yup, the RT allocator is showing that it was never intended for
> > > general purpose data storage workloads... :P
> > 
> > Indeed. What they really want is the AG allocator with metadata on a
> > separate device, but that sounds like a much bigger project.
> > 
> > > So how much memory would it require to keep an in-memory copy of
> > > the summary information? i.e. do an in-memory copy search, then once
> > > the block is found, pull in the buffer that needs modifying and log
> > > it? That gets rid of the buffer management overhead, and potentially
> > > allows for more efficient search algorithms to be used.
> > 
> > Yeah, I considered that. We use 256kB realtime extents for the 8 TB
> > filesystem, so it's 100kB. If we were using 4kB realtime extents, it'd
> > be about 7MB. So, it's doable but not the best.
> 
> Quite frankly, that's a small amount compared to the amount of
> metadata we typically cache on an active filesystem. If we are
> really talking about 1MB of RAM per TB of disk space at the worst
> case, then that's and acceptible amount to spend on speeding up the
> allocator. This patch is a good stepping stone - would you be able
> to look into implementing a full in-memory summary cache?

We noticed some other issues with the realtime allocator (namely, using
the "near" allocation strategy for new files leads to bad
fragmentation), so it is on my todo list to make it more intelligent --
it's good to know that it'd be acceptable to cache much more.

v2 of this patch incoming.