On Wed, Nov 22, 2017 at 07:21:01AM -0500, Brian Foster wrote: > On Wed, Nov 22, 2017 at 01:26:45PM +1100, Dave Chinner wrote: > > On Tue, Nov 21, 2017 at 12:35:57PM -0500, Brian Foster wrote: > > > cc linux-xfs > > > > > > On Tue, Nov 21, 2017 at 10:51:10AM -0600, Mark Tinguely wrote: > > > > STATIC uint > > > > xfs_calc_ifree_reservation( > > > > struct xfs_mount *mp) > > > > { > > > > return XFS_DQUOT_LOGRES(mp) + > > > > mp->m_sb.sb_inodesize + > > > > mp->m_sb.sb_sectsize + > > > > mp->m_sb.sb_sectsize + > > > > XFS_FSB_TO_B(mp, 1) + > > > > MAX((__uint16_t)XFS_FSB_TO_B(mp, 1), > > > > XFS_INODE_CLUSTER_SIZE(mp)) + > > > > 128 * 5 + > > > > XFS_ALLOCFREE_LOG_RES(mp, 1) + > > > > 128 * (2 + XFS_IALLOC_BLOCKS(mp) + mp->m_in_maxlevels + > > > > XFS_ALLOCFREE_LOG_COUNT(mp, 1)); > > > > } > > > > ... > > > > > > Indeed, this looks fairly similar to the line cited above from the > > > current code (modulo the xfs_calc_buf_res() refactoring). > > > > Right, the refactoring should have been a straight conversion of the > > code - and not change any of the reservations. If they were wrong > > before the refactoring, they will still be wrong now. > > > > > I'm curious > > > why we only consider the buffer overhead (i.e., 128 bytes * > > > ->m_in_maxlevels) of the ->m_in_maxlevels value rather than the full > > > buffer size. > > > > That just looks like a bug. Earlier on in the calculation we > > take into account: > > > > * the inode btree entry: block size > > .... > > xfs_calc_buf_res(1, XFS_FSB_TO_B(mp, 1)) + > > > > So we've accounted for the AGI btree block the record being modified > > exists in. But we don't account for tree shape changes.... > > > > .... and, as usual, knowing the history of the code tells me the > > probable reason why the reservation is like this. > > > > i.e. inodes could not be removed from disk when the original > > reservation was defined. Once allocated, they never got > > removed, so an ifree transaction only ever touched the inode btree > > block the chunk record was in. So, a quick dive into history from > > the XFS archive with git blame: > > > > Ah, I wasn't aware of that. That definitely helps explain the current > reservation. > > > 814994f5c75f6 (Adam Sweeney 1994-09-23 * In freeing an inode we can modify: > > 814994f5c75f6 (Adam Sweeney 1994-09-23 * the inode being freed: inode size > > 814994f5c75f6 (Adam Sweeney 1994-09-23 * the super block free inode counter: sector size > > 814994f5c75f6 (Adam Sweeney 1994-09-23 * the agi hash list and counters: sector size > > 814994f5c75f6 (Adam Sweeney 1994-09-23 * the inode btree entry: block size > > cd235688e046e (Adam Sweeney 1995-08-31 * the on disk inode before ours in the agi hash list: inode cluster size > > 254f6311ed1b7 (Steve Lord 2003-10-06 * the inode btree: max depth * blocksize > > 254f6311ed1b7 (Steve Lord 2003-10-06 * the allocation btrees: 2 trees * (max depth - 1) * block size > > 814994f5c75f6 (Adam Sweeney 1994-09-23 */ > > d3c15dd08b409 (Russell Cattelan 2003-04-15 #define XFS_CALC_IFREE_LOG_RES(mp) \ > > 814994f5c75f6 (Adam Sweeney 1994-09-23 ((mp)->m_sb.sb_inodesize + \ > > 814994f5c75f6 (Adam Sweeney 1994-09-23 (mp)->m_sb.sb_sectsize + \ > > 814994f5c75f6 (Adam Sweeney 1994-09-23 (mp)->m_sb.sb_sectsize + \ > > 814994f5c75f6 (Adam Sweeney 1994-09-23 XFS_FSB_TO_B((mp), 1) + \ > > 0139225509eb1 (Steve Lord 2001-09-11 MAX((__uint16_t)XFS_FSB_TO_B((mp), 1), XFS_INODE_CLUSTER_SIZE(mp)) + \ > > 254f6311ed1b7 (Steve Lord 2003-10-06 (128 * 5) + \ > > 254f6311ed1b7 (Steve Lord 2003-10-06 (128 * (2 + XFS_IALLOC_BLOCKS(mp) + XFS_IN_MAXLEVELS(mp) + \ > > 254f6311ed1b7 (Steve Lord 2003-10-06 XFS_ALLOCFREE_LOG_COUNT(mp, 1)))) > > 254f6311ed1b7 (Steve Lord 2003-10-06 > > > > Without even looking I know what those 2003 changes were. For > > everyone else: > > > > 254f6311ed1b Implement deletion of inode clusters in XFS. > > > > So, yeah, this looks like a bug in the original reservation > > calculation for freeing inode clusters. I suspect it got messed up > > because the buffers covering XFS_IALLOC_BLOCKS (the inode chunk) get > > marked stale and so their contents are not logged. That means just > > the header+blf is logged and so the reservation is correct for them. > > that isn't the case for the inobt blocks themselves, however. > > > > Yep, that's what I figured for ->m_ialloc_blks. > > > i.e. it should be calculating: > > > > .... > > * the inode chunk blocks are stale, so only headers get logged > > * the inode btree: max depth * blocksize > > .... > > xfs_calc_buf_res(2 + mp->m_ialloc_blks, 0) + > > xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) + > > .... > > > > Sounds about right.. though I think we should also replace the old > single block reservation ("the inode btree entry: block size") as well > since that is now a subset of the m_in_maxlevels res. > > > I'm still not 100% sure what the extra "2 +" is accounting for > > there, though, so to be safe that migh need to be accounted as full > > blocks, not log headers. > > > > Same.. there is still a bit of cruft in this calculation that really > isn't clear. That +2 is one part and then there's another > xfs_calc_buf_res(1, 0) right before the m_ialloc_blks line. The fact > that it's a separate line implies it is logically separate, but who > knows. > > I suppose I'll keep that line as is, replace the inobt entry bit with > the full inobt calculation (as above), keep the +2 associated with the > inobt calculation and run some tests on that. > > > Good find, Brian! > > > > Of course, that doesn't mean it will fix the transaction overrun > > being reported because it's overrunning the freespace tree > > reservations rather than the inobt reservation. It certainly won't > > hurt, though. :P > > > > Right.. technically this is a separate issue but I actually think it may > end up resolving the overrun in most cases. I tested a quick hack to > change the 0 in the existing code to XFS_FSB_TO_B(mp, 1) before I > realized that the m_ialloc_blks part still needs to be separate. That > survived all my tests, but it's also larger than what we've settled on > here. I think the overrun has been within m_in_maxlevels blocks worth of > bytes in most cases. I do recall seeing occasional larger instances, > however, I just don't recall the exact numbers. I'll give this some > testing and see if we need to also bolt on some kind of agfl fixup > limit. > Quick update... I ended up still hitting an overrun with this fix in place (actually, the original bug report shows an overrun of ~32k which should have indicated that was going to be the case). In any event, that had me thinking about tweaking how we manage the agfl and subsequently brings up a couple other issues. Firstly, this is a large fs, so we have 1TB AGs and the associated 16 entry agfls. The min agfl allocation is 8 at the time of the overrun, which is the maximum case for such a filesystem (4 per space btree). IIUC, as much as we need to keep the agfl populated with a minimum of 8 blocks in this case, we conversely also need to make sure we have 8 free slots, yes? If so, I'm not sure that being lazy about freeing blocks is an option because I have at least one trace where it takes 3 agfl block frees just to get the agfl from 9 down to the required 8. Unrelated to the overrun issue, I also notice that we use the agfl for the rmapbt. That adds another 5 blocks in the maximum case, which means a 1TB AG can have a min agfl requirement of 4+4+5=13 blocks allocated of 16 slots. Unless I'm missing something in the reasoning above, aren't we kind of playing with fire in that case? E.g., what prevents us from doing a 4-level join on one of the trees with a 13/16 agfl and running out of space to free btree blocks? This has me wondering whether we need some kind of larger/dynamic agfl that facilitates the additional use (and potentially lazier management to resolve the "too many fixups" issue). Thoughts? Brian > Thanks for the feedback! > > Brian > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@xxxxxxxxxxxxx > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html