On Tue, Nov 13, 2018 at 09:42:34AM -0800, Darrick J. Wong wrote: > On Sat, Nov 10, 2018 at 07:59:58PM -0800, Omar Sandoval wrote: > > From: Omar Sandoval <osandov@xxxxxx> > > > > The realtime summary is a two-dimensional array on disk, effectively: > > > > u32 rsum[log2(number of realtime extents) + 1][number of blocks in the bitmap] > > > > rsum[log][bbno] is the number of extents of size 2**log which start in > > bitmap block bbno. > > > > xfs_rtallocate_extent_near() uses xfs_rtany_summary() to check whether > > rsum[log][bbno] != 0 for any log level. However, the summary array is > > stored in row-major order (i.e., like an array in C), so all of these > > entries are not adjacent, but rather spread across the entire summary > > file. In the worst case (a full bitmap block), xfs_rtany_summary() has > > to check every level. > > > > This means that on a moderately-used realtime device, an allocation will > > waste a lot of time finding, reading, and releasing buffers for the > > realtime summary. In particular, one of our storage services (which runs > > on servers with 8 very slow CPUs and 15 8 TB XFS realtime filesystems) > > spends almost 5% of its CPU cycles in xfs_rtbuf_get() and > > xfs_trans_brelse() called from xfs_rtany_summary(). > > > > One solution would be to also store the summary with the dimensions > > swapped. However, this would require a disk format change to a very old > > component of XFS. > > > > Instead, we can cache the minimum size which contains any extents. We do > > so lazily; rather than guaranteeing that the cache contains the precise > > minimum, it always contains a loose lower bound which we tighten when we > > read or update a summary block. This only uses a few kilobytes of memory > > and is already serialized via the realtime bitmap and summary inode > > locks, so the cost is minimal. With this change, the same workload only > > spends 0.2% of its CPU cycles in the realtime allocator. > > > > Signed-off-by: Omar Sandoval <osandov@xxxxxx> > > --- [snip] > Hmmm, how much memory does this require? > > Let's say I had a 64TB realtime volume on a 4k block filesystem and 1 > block per rt extent. > > That's ... 2^(46 - 12) = 2^34 rt blocks. > > Each rtbitmap block tracks 2^(12+3) = 2^15 blocks, which means that > there are 2^(34-15) = 2^19 rtbitmap blocks. > > The cache requires 1 byte per rtbitmap block (2^19) which means it > requires ~512k of memory? And if I had 1EB that would be 8MB of RAM? > > (Granted you said you use 256K rt extents, which cuts down the memory > requirements by 64x, but I don't want to assume everyone will do this.) Yup, that math looks right. > > + if (!mp->m_rsum_cache) { > > + xfs_irele(mp->m_rbmip); > > + xfs_irele(mp->m_rsumip); > > + return -ENOMEM; > > + } > > That's a pretty high order memory allocation and it seem unfortunate to > fail the mount because we couldn't cobble together enough [kv]memory to > set up a rtsummary cache. > > One option would be to fall back to reading the rtsummary file if > m_rsum_cache == NULL (i.e. we couldn't get enough memory to set up the > cache). Good idea, I'll make this continue without the cache if it can't be allocated. Thanks for the review. > Another option could be to use the xfs big memory array that > I've been developing for the xfs online repair patchset which uses a > memfd to create a byte-addressable array whose pages can be swapped out. > The downside of xfbma is that array accesses are pretty heavyweight. > > --D