On Tue, Feb 23, 2021 at 03:46:36PM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@xxxxxxxxxx> > > We process the buf_log_item bitmap one set bit at a time with > xfs_next_bit() so we can detect if a region crosses a memcpy > discontinuity in the buffer data address. This has massive overhead > on large buffers (e.g. 64k directory blocks) because we do a lot of > unnecessary checks and xfs_buf_offset() calls. > > For example, 16-way concurrent create workload on debug kernel > running CPU bound has this at the top of the profile at ~120k > create/s on 64kb directory block size: > > 20.66% [kernel] [k] xfs_dir3_leaf_check_int > 7.10% [kernel] [k] memcpy > 6.22% [kernel] [k] xfs_next_bit > 3.55% [kernel] [k] xfs_buf_offset > 3.53% [kernel] [k] xfs_buf_item_format > 3.34% [kernel] [k] __pv_queued_spin_lock_slowpath > 3.04% [kernel] [k] do_raw_spin_lock > 2.84% [kernel] [k] xfs_buf_item_size_segment.isra.0 > 2.31% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock > 1.36% [kernel] [k] xfs_log_commit_cil > > (debug checks hurt large blocks) > > The only buffers with discontinuities in the data address are > unmapped buffers, and they are only used for inode cluster buffers > and only for logging unlinked pointers. IOWs, it is -rare- that we > even need to detect a discontinuity in the buffer item formatting > code. > > Optimise all this by using xfs_contig_bits() to find the size of > the contiguous regions, then test for a discontiunity inside it. If > we find one, do the slow "bit at a time" method we do now. If we > don't, then just copy the entire contiguous range in one go. > > Profile now looks like: > > 25.26% [kernel] [k] xfs_dir3_leaf_check_int > 9.25% [kernel] [k] memcpy > 5.01% [kernel] [k] __pv_queued_spin_lock_slowpath > 2.84% [kernel] [k] do_raw_spin_lock > 2.22% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock > 1.88% [kernel] [k] xfs_buf_find > 1.53% [kernel] [k] memmove > 1.47% [kernel] [k] xfs_log_commit_cil > .... > 0.34% [kernel] [k] xfs_buf_item_format > .... > 0.21% [kernel] [k] xfs_buf_offset > .... > 0.16% [kernel] [k] xfs_contig_bits > .... > 0.13% [kernel] [k] xfs_buf_item_size_segment.isra.0 > > So the bit scanning over for the dirty region tracking for the > buffer log items is basically gone. Debug overhead hurts even more > now... > > Perf comparison > > dir block creates unlink > size (kb) time rate time > > Original 4 4m08s 220k 5m13s > Original 64 7m21s 115k 13m25s > Patched 4 3m59s 230k 5m03s > Patched 64 6m23s 143k 12m33s > > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > --- > fs/xfs/xfs_buf_item.c | 102 +++++++++++++++++++++++++++++++++++------- > 1 file changed, 87 insertions(+), 15 deletions(-) > > diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c > index 91dc7d8c9739..14d1fefcbf4c 100644 > --- a/fs/xfs/xfs_buf_item.c > +++ b/fs/xfs/xfs_buf_item.c ... > @@ -84,20 +90,51 @@ xfs_buf_item_size_segment( > int *nbytes) > { > struct xfs_buf *bp = bip->bli_buf; > + int first_bit; > + int nbits; > int next_bit; > int last_bit; > > - last_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size, 0); > - if (last_bit == -1) > + first_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size, 0); > + if (first_bit == -1) > return; > > - /* > - * initial count for a dirty buffer is 2 vectors - the format structure > - * and the first dirty region. > - */ > - *nvecs += 2; > - *nbytes += xfs_buf_log_format_size(blfp) + XFS_BLF_CHUNK; > + (*nvecs)++; > + *nbytes += xfs_buf_log_format_size(blfp); > + > + do { > + nbits = xfs_contig_bits(blfp->blf_data_map, > + blfp->blf_map_size, first_bit); > + ASSERT(nbits > 0); > + > + /* > + * Straddling a page is rare because we don't log contiguous > + * chunks of unmapped buffers anywhere. > + */ > + if (nbits > 1 && > + xfs_buf_item_straddle(bp, offset, first_bit, nbits)) > + goto slow_scan; > + > + (*nvecs)++; > + *nbytes += nbits * XFS_BLF_CHUNK; > + > + /* > + * This takes the bit number to start looking from and > + * returns the next set bit from there. It returns -1 > + * if there are no more bits set or the start bit is > + * beyond the end of the bitmap. > + */ > + first_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size, > + (uint)first_bit + nbits + 1); I think the range tracking logic would be a bit more robust to not +1 here and thus not make assumptions on how the rest of the loop is implemented, but regardless this looks Ok to me: Reviewed-by: Brian Foster <bfoster@xxxxxxxxxx> > + } while (first_bit != -1); > > + return; > + > +slow_scan: > + /* Count the first bit we jumped out of the above loop from */ > + (*nvecs)++; > + *nbytes += XFS_BLF_CHUNK; > + last_bit = first_bit; > while (last_bit != -1) { > /* > * This takes the bit number to start looking from and > @@ -115,11 +152,14 @@ xfs_buf_item_size_segment( > if (next_bit == -1) { > break; > } else if (next_bit != last_bit + 1 || > - xfs_buf_item_straddle(bp, offset, next_bit, last_bit)) { > + xfs_buf_item_straddle(bp, offset, first_bit, nbits)) { > last_bit = next_bit; > + first_bit = next_bit; > (*nvecs)++; > + nbits = 1; > } else { > last_bit++; > + nbits++; > } > *nbytes += XFS_BLF_CHUNK; > } > @@ -276,6 +316,38 @@ xfs_buf_item_format_segment( > /* > * Fill in an iovec for each set of contiguous chunks. > */ > + do { > + ASSERT(first_bit >= 0); > + nbits = xfs_contig_bits(blfp->blf_data_map, > + blfp->blf_map_size, first_bit); > + ASSERT(nbits > 0); > + > + /* > + * Straddling a page is rare because we don't log contiguous > + * chunks of unmapped buffers anywhere. > + */ > + if (nbits > 1 && > + xfs_buf_item_straddle(bp, offset, first_bit, nbits)) > + goto slow_scan; > + > + xfs_buf_item_copy_iovec(lv, vecp, bp, offset, > + first_bit, nbits); > + blfp->blf_size++; > + > + /* > + * This takes the bit number to start looking from and > + * returns the next set bit from there. It returns -1 > + * if there are no more bits set or the start bit is > + * beyond the end of the bitmap. > + */ > + first_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size, > + (uint)first_bit + nbits + 1); > + } while (first_bit != -1); > + > + return; > + > +slow_scan: > + ASSERT(bp->b_addr == NULL); > last_bit = first_bit; > nbits = 1; > for (;;) { > @@ -300,7 +372,7 @@ xfs_buf_item_format_segment( > blfp->blf_size++; > break; > } else if (next_bit != last_bit + 1 || > - xfs_buf_item_straddle(bp, offset, next_bit, last_bit)) { > + xfs_buf_item_straddle(bp, offset, first_bit, nbits)) { > xfs_buf_item_copy_iovec(lv, vecp, bp, offset, > first_bit, nbits); > blfp->blf_size++; > -- > 2.28.0 >