Re: [PATCH 1/2] xfs: Fix log reservation calculation for xattr insert operation

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 7 Apr 2020 08:57:58 +1000

On Mon, Apr 06, 2020 at 11:25:40AM -0400, Brian Foster wrote:
> On Sat, Apr 04, 2020 at 02:22:02PM +0530, Chandan Rajendra wrote:
> > Log space reservation for xattr insert operation is divided into two
> > parts,
> > 1. Mount time
> >    - Inode
> >    - Superblock for accounting space allocations
> >    - AGF for accounting space used by count, block number, rmap and refcnt
> >      btrees.
> > 
> > 2. The remaining log space can only be calculated at run time because,
> >    - A local xattr can be large enough to cause a double split of the da
> >      btree.
> >    - The value of the xattr can be large enough to be stored in remote
> >      blocks. The contents of the remote blocks are not logged.
> > 
> >    The log space reservation could be,
> >    - (XFS_DA_NODE_MAXDEPTH + 1) number of blocks. The "+ 1" is required in
> >      case xattr is large enough to cause another split of the da btree path.
> >    - BMBT blocks for storing (XFS_DA_NODE_MAXDEPTH + 1) record
> >      entries.
> >    - Space for logging blocks of count, block number, rmap and refcnt btrees.
> > 
> > At present, mount time log reservation includes block count required for a
> > single split of the dabtree. The dabtree block count is also taken into
> > account by xfs_attr_calc_size().
> > 
> > Also, AGF log space reservation isn't accounted for.
> > 
> > Due to the reasons mentioned above, log reservation calculation for xattr
> > insert operation gives an incorrect value.
> > 
> > Apart from the above, xfs_log_calc_max_attrsetm_res() passes byte count as
> > an argument to XFS_NEXTENTADD_SPACE_RES() instead of block count.
> > 
> > The above mentioned inconsistencies were discoverd when trying to mount a
> > modified XFS filesystem which uses a 32-bit value as xattr extent counter
> > caused the following warning messages to be printed on the console,
> > 
> > XFS (loop0): Mounting V4 Filesystem
> > XFS (loop0): Log size 2560 blocks too small, minimum size is 4035 blocks
> > XFS (loop0): Log size out of supported range.
> > XFS (loop0): Continuing onwards, but if log hangs are experienced then please report this message in the bug report.
> > XFS (loop0): Ending clean mount
> > 
> > To fix the inconsistencies described above, this commit replaces 'mount'
> > and 'runtime' components with just one static reservation. The new
> > reservation calculates the log space for the worst case possible i.e. it
> > considers,
> > 1. Double split of the da btree.
> >    This happens for large local xattrs.
> > 2. Bmbt blocks required for mapping the contents of a maximum
> >    sized (i.e. XATTR_SIZE_MAX bytes in size) remote attribute.
> > 
> 
> Hmm.. so the last I recall looking at this, the change was more around
> refactoring the mount vs. runtime portions of the xattr reservation to
> be more accurate. This approach eliminates the runtime portion for a
> 100% mount time reservation calculation. Can you elaborate on why the
> change in approach? Also, it looks like at least one tradeoff here is
> reservation size (to the point where we increase min log size on small
> filesystems?).

What's not in this commit message is that this was actually my idea
that I had when Chandan contacted me off list about his refactoring
of the reservation blowing out reservations for attribute operations
by a factor of 10.

My fault, I should have pushed the discussion back to the mailing
list rather than answering directly.  I'll repeat a lot of my
analysis from that discussion below to get everyone up to speed.

[ Chandan, in future I'm going to insist that all your XFS questions
need to be on the list, so that everyone sees the discusions and
understands the reasons why things are suggested. It's also a good
idea to use "suggested-by" when presenting code based on other
people's ideas, just so that everyone knows that there were more
people involved than just yourself... ]

So, when I went through all the reservations changes that Chandan
had made I realised that the current code was wrong in lots of ways,
and when I looked at it from the fundamental changes being made the
mount vs runtime split made no sense at all.

Such as:

- the dabtree double split was a double _leaf block split only_. It
  is not a full tree split, and can only result in a single parent
  split because there is only one path update after the double leaf
  split has been done. Hence it can only do one full dabtree split
  and the code in xfs_attr_calc_size() that doubles the block count
  reservation for the double leaf split is wrong.  We only need on
  extra block, and that is just:

dgc>  #define XFS_DAENTER_DBS(mp,w)        \
dgc> -     (XFS_DA_NODE_MAXDEPTH + (((w) == XFS_DATA_FORK) ? 2 : 0))
dgc> +     (XFS_DA_NODE_MAXDEPTH + (((w) == XFS_DATA_FORK) ? 2 : 1))

[ Note: The "+ 2" for the data fork reservation is for the dir data
block and a potential free space index block that get added in a
typical directory entry addition. ]

- remote attributes are not logged, so only the BMBT block
  reservation is needed for that extent allocation. i.e. we need to
  reserve the blocks for the xattr, but we don't need log space
  for them.  xfs_attr_calc_size() gets this right,
  xfs_log_calc_max_attrsetm_res() gets this wrong in that it does
  not take into account remote attr BMBT blocks at all.

- xfs_attr_calc_size() calculates the number of blocks we need to
  allocate for the attr operation, not the number of blocks we need
  to log, hence can't be used to replace
  xfs_log_calc_max_attrsetm_res().

- the runtime reservation is just a BMBT block reservation for a
  single block to be allocated. multiplying the number of blocks we
  need to allocate by M_RES(mp)->tr_attrsetrt.tr_logres to get the
  log reservation is wrong. We are not doing a full BMBT split for
  every block in the attribute we modify, so the log reservation is
  massively oversized by xfs_log_calc_max_attrsetm_res() and
  xfs_attr_set() by multipling the block count (including BMBT
  blocks we allocate) by a full bmbt split reservation.

IOWs, the code as it stands now is just wrong. It works because it
massively oversizes the runtime reservations, but that in itself is
a problem.  To quote myself again from that analysis:

dgc> The log reservation that covers both local and remote attributes:
dgc>
dgc> blks =  full dabtree split + 1 leaf block + bmbt blocks
dgc> blks += nextent_res(MAX_ATTR_LEN/block size) // bmbt blocks only
dgc> resv =	inode + sb + agf +
dgc>		xfs_calc_buf_res(blks) +
dgc>		allocfree_log_count(blks);

THe first line takes into account the blocks we modify in a local
attribute tree modification. The second line takes into account the
BMBT logging overhead of a remote attribute. The "resv" calculation
converts that modified block count into a log reservation and adds
the freespace tree logging overhead of allocating all those blocks.

The only thing that is variable at runtime now is the size of the
remote attribute, but we already have a log reservation for the
allocation and BMBT block modification side of that and so we only
need to physically reserve the block space (i.e. via
block count passed to xfs_trans_alloc()).

IOWs, the log reservation does not need to change at runtime now.

It also makes it clear that changing the attr fork extent count from
16 to 32 bits should only impact the BMBT reservations.  The dabtree
reservations already use XFS_DA_NODE_MAXDEPTH for the attr fork and
hence so they already are sized for max dabtree depth reservations.

As a result, the attr reservation itself should not grow excessively
for 32bit attribute fork extent counts. It should maybe 20-30 blocks
on a 4kb block size filesystem as we add 4-5 levels to the max depth
of the BMBT on the attribute fork. It should certainly not grow by
400 blocks as the original reworking resulted in...

Again, sorry for not getting this discussion out onto the mailing
list originally, it should have been there.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx