On Tue, Nov 06, 2018 at 10:49:48AM +1100, Dave Chinner wrote: > On Fri, Nov 02, 2018 at 12:38:00PM -0700, Omar Sandoval wrote: > > From: Omar Sandoval <osandov@xxxxxx> > > > > The realtime summary is a two-dimensional array on disk, effectively: > > > > u32 rsum[log2(number of realtime extents) + 1][number of blocks in the bitmap] > > > > rsum[log][bbno] is the number of extents of size 2**log which start in > > bitmap block bbno. > > > > xfs_rtallocate_extent_near() uses xfs_rtany_summary() to check whether > > rsum[log][bbno] != 0 for any log level. However, the summary array is > > stored in row-major order (i.e., like an array in C), so all of these > > entries are not adjacent, but rather spread across the entire summary > > file. In the worst case (a full bitmap block), xfs_rtany_summary() has > > to check every level. > > > > This means that on a moderately-used realtime device, an allocation will > > waste a lot of time finding, reading, and releasing buffers for the > > realtime summary. In particular, one of our storage services (which runs > > on servers with 8 very slow CPUs and 15 8 TB XFS realtime filesystems) > > spends almost 5% of its CPU cycles in xfs_rtbuf_get() and > > xfs_trans_brelse() called from xfs_rtany_summary(). > > Yup, the RT allocator is showing that it was never intended for > general purpose data storage workloads... :P Indeed. What they really want is the AG allocator with metadata on a separate device, but that sounds like a much bigger project. > So how much memory would it require to keep an in-memory copy of > the summary information? i.e. do an in-memory copy search, then once > the block is found, pull in the buffer that needs modifying and log > it? That gets rid of the buffer management overhead, and potentially > allows for more efficient search algorithms to be used. Yeah, I considered that. We use 256kB realtime extents for the 8 TB filesystem, so it's 100kB. If we were using 4kB realtime extents, it'd be about 7MB. So, it's doable but not the best. > > One solution would be to swap the dimensions of the summary array so > > that different sizes on the same bitmap block are adjacent. However, > > this would require a disk format change to a very old component of XFS, > > and it would slow down xfs_rtallocate_extent_size(). > > Or maybe we should put a second summary on disk, ordered the other > way similar to how the btree allocator freespace indexes are > organised. That's another thing I considered, but I wanted to avoid a disk format change if possible, and this small cache is pretty much just as good. > > Instead, we can cache the minimum size which contains any extents. We do > > so lazily; rather than guaranteeing that the cache contains the precise > > minimum, it always contains a loose lower bound which we tighten when we > > read or update a summary block. This only uses a few kilobytes of memory > > and is already serialized via the realtime bitmap and summary inode > > locks, so the cost is minimal. With this change, the same workload only > > spends 0.2% of its CPU cycles in the realtime allocator. > > Good win. > > > @@ -1187,8 +1195,8 @@ xfs_rtmount_init( > > } > > > > /* > > - * Get the bitmap and summary inodes into the mount structure > > - * at mount time. > > + * Get the bitmap and summary inodes and the summary cache into the mount > > + * structure at mount time. > > */ > > int /* error */ > > xfs_rtmount_inodes( > > @@ -1211,6 +1219,16 @@ xfs_rtmount_inodes( > > return error; > > } > > ASSERT(mp->m_rsumip != NULL); > > + /* > > + * The rsum cache is initialized to all zeroes, which trivially > > + * satisfies the invariant. > > Satisfies what invariant? Please explain why this is OK and how it > interacts with the allocation code so we're not left guessing what > this means in the future. > > Remember: comments are not for now - they are for 10 years time when > no-one remembers what this code does or is actually used for. The invariant I specified in the definition in xfs_mount_t, but I'll clarify this. > > + */ > > + mp->m_rsum_cache = kvzalloc(sbp->sb_rbmblocks, GFP_KERNEL); > > kmem_zalloc_large(). Will change. > Which makes me ask, just how large is this allocation if you're > using vmalloc() straight out of the box? For our 256kB extent size, it's only ~1000 bytes. However, for the same size filesystem with 4kB extents, it'd be ~60kB. > > + if (!mp->m_rsum_cache) { > > + xfs_irele(mp->m_rbmip); > > + xfs_irele(mp->m_rsumip); > > + return -ENOMEM; > > + } > > return 0; > > } > > > > @@ -1218,6 +1236,7 @@ void > > xfs_rtunmount_inodes( > > struct xfs_mount *mp) > > { > > + kvfree(mp->m_rsum_cache); > > kmem_free(). Will change. Thanks for the review!