On Tuesday, April 7, 2020 4:27 AM Dave Chinner wrote: > On Mon, Apr 06, 2020 at 11:25:40AM -0400, Brian Foster wrote: > > On Sat, Apr 04, 2020 at 02:22:02PM +0530, Chandan Rajendra wrote: > > > Log space reservation for xattr insert operation is divided into two > > > parts, > > > 1. Mount time > > > - Inode > > > - Superblock for accounting space allocations > > > - AGF for accounting space used by count, block number, rmap and refcnt > > > btrees. > > > > > > 2. The remaining log space can only be calculated at run time because, > > > - A local xattr can be large enough to cause a double split of the da > > > btree. > > > - The value of the xattr can be large enough to be stored in remote > > > blocks. The contents of the remote blocks are not logged. > > > > > > The log space reservation could be, > > > - (XFS_DA_NODE_MAXDEPTH + 1) number of blocks. The "+ 1" is required in > > > case xattr is large enough to cause another split of the da btree path. > > > - BMBT blocks for storing (XFS_DA_NODE_MAXDEPTH + 1) record > > > entries. > > > - Space for logging blocks of count, block number, rmap and refcnt btrees. > > > > > > At present, mount time log reservation includes block count required for a > > > single split of the dabtree. The dabtree block count is also taken into > > > account by xfs_attr_calc_size(). > > > > > > Also, AGF log space reservation isn't accounted for. > > > > > > Due to the reasons mentioned above, log reservation calculation for xattr > > > insert operation gives an incorrect value. > > > > > > Apart from the above, xfs_log_calc_max_attrsetm_res() passes byte count as > > > an argument to XFS_NEXTENTADD_SPACE_RES() instead of block count. > > > > > > The above mentioned inconsistencies were discoverd when trying to mount a > > > modified XFS filesystem which uses a 32-bit value as xattr extent counter > > > caused the following warning messages to be printed on the console, > > > > > > XFS (loop0): Mounting V4 Filesystem > > > XFS (loop0): Log size 2560 blocks too small, minimum size is 4035 blocks > > > XFS (loop0): Log size out of supported range. > > > XFS (loop0): Continuing onwards, but if log hangs are experienced then please report this message in the bug report. > > > XFS (loop0): Ending clean mount > > > > > > To fix the inconsistencies described above, this commit replaces 'mount' > > > and 'runtime' components with just one static reservation. The new > > > reservation calculates the log space for the worst case possible i.e. it > > > considers, > > > 1. Double split of the da btree. > > > This happens for large local xattrs. > > > 2. Bmbt blocks required for mapping the contents of a maximum > > > sized (i.e. XATTR_SIZE_MAX bytes in size) remote attribute. > > > > > > > Hmm.. so the last I recall looking at this, the change was more around > > refactoring the mount vs. runtime portions of the xattr reservation to > > be more accurate. This approach eliminates the runtime portion for a > > 100% mount time reservation calculation. Can you elaborate on why the > > change in approach? Also, it looks like at least one tradeoff here is > > reservation size (to the point where we increase min log size on small > > filesystems?). > > What's not in this commit message is that this was actually my idea > that I had when Chandan contacted me off list about his refactoring > of the reservation blowing out reservations for attribute operations > by a factor of 10. > > My fault, I should have pushed the discussion back to the mailing > list rather than answering directly. I'll repeat a lot of my > analysis from that discussion below to get everyone up to speed. > > [ Chandan, in future I'm going to insist that all your XFS questions > need to be on the list, so that everyone sees the discusions and > understands the reasons why things are suggested. It's also a good > idea to use "suggested-by" when presenting code based on other > people's ideas, just so that everyone knows that there were more > people involved than just yourself... ] > > So, when I went through all the reservations changes that Chandan > had made I realised that the current code was wrong in lots of ways, > and when I looked at it from the fundamental changes being made the > mount vs runtime split made no sense at all. > > Such as: > > - the dabtree double split was a double _leaf block split only_. It > is not a full tree split, and can only result in a single parent > split because there is only one path update after the double leaf > split has been done. Hence it can only do one full dabtree split > and the code in xfs_attr_calc_size() that doubles the block count > reservation for the double leaf split is wrong. We only need on > extra block, and that is just: > > dgc> #define XFS_DAENTER_DBS(mp,w) \ > dgc> - (XFS_DA_NODE_MAXDEPTH + (((w) == XFS_DATA_FORK) ? 2 : 0)) > dgc> + (XFS_DA_NODE_MAXDEPTH + (((w) == XFS_DATA_FORK) ? 2 : 1)) > > [ Note: The "+ 2" for the data fork reservation is for the dir data > block and a potential free space index block that get added in a > typical directory entry addition. ] > > - remote attributes are not logged, so only the BMBT block > reservation is needed for that extent allocation. i.e. we need to > reserve the blocks for the xattr, but we don't need log space > for them. xfs_attr_calc_size() gets this right, > xfs_log_calc_max_attrsetm_res() gets this wrong in that it does > not take into account remote attr BMBT blocks at all. > > - xfs_attr_calc_size() calculates the number of blocks we need to > allocate for the attr operation, not the number of blocks we need > to log, hence can't be used to replace > xfs_log_calc_max_attrsetm_res(). > > - the runtime reservation is just a BMBT block reservation for a > single block to be allocated. multiplying the number of blocks we > need to allocate by M_RES(mp)->tr_attrsetrt.tr_logres to get the > log reservation is wrong. We are not doing a full BMBT split for > every block in the attribute we modify, so the log reservation is > massively oversized by xfs_log_calc_max_attrsetm_res() and > xfs_attr_set() by multipling the block count (including BMBT > blocks we allocate) by a full bmbt split reservation. > > IOWs, the code as it stands now is just wrong. It works because it > massively oversizes the runtime reservations, but that in itself is > a problem. To quote myself again from that analysis: > > dgc> The log reservation that covers both local and remote attributes: > dgc> > dgc> blks = full dabtree split + 1 leaf block + bmbt blocks > dgc> blks += nextent_res(MAX_ATTR_LEN/block size) // bmbt blocks only > dgc> resv = inode + sb + agf + > dgc> xfs_calc_buf_res(blks) + > dgc> allocfree_log_count(blks); > > THe first line takes into account the blocks we modify in a local > attribute tree modification. The second line takes into account the > BMBT logging overhead of a remote attribute. The "resv" calculation > converts that modified block count into a log reservation and adds > the freespace tree logging overhead of allocating all those blocks. > > The only thing that is variable at runtime now is the size of the > remote attribute, but we already have a log reservation for the > allocation and BMBT block modification side of that and so we only > need to physically reserve the block space (i.e. via > block count passed to xfs_trans_alloc()). > > IOWs, the log reservation does not need to change at runtime now. > > It also makes it clear that changing the attr fork extent count from > 16 to 32 bits should only impact the BMBT reservations. The dabtree > reservations already use XFS_DA_NODE_MAXDEPTH for the attr fork and > hence so they already are sized for max dabtree depth reservations. > > As a result, the attr reservation itself should not grow excessively > for 32bit attribute fork extent counts. It should maybe 20-30 blocks > on a 4kb block size filesystem as we add 4-5 levels to the max depth > of the BMBT on the attribute fork. It should certainly not grow by > 400 blocks as the original reworking resulted in... > > Again, sorry for not getting this discussion out onto the mailing > list originally, it should have been there. It is actually me at fault here. I am sorry for not having the conversation on the mailing list in the first place. -- chandan