On Mon, Nov 27, 2017 at 03:24:34PM -0500, Brian Foster wrote: > Analysis of recent reports of log reservation overruns and code > inspection has uncovered that the reservations associated with inode > operations may not cover the worst case scenarios. In particular, > many cases only include one allocfree res. for a particular > operation even though said operations may also entail AGFL fixups > and inode btree block allocations in addition to the actual inode > chunk allocation. This can easily turn into two or three block > allocations (or frees) per operation. > > In theory, the only way to define the worst case reservation is to > include an allocfree res for each individual allocation in a > transaction. Since that is impractical (we can perform multiple agfl > fixups per tx and not every allocation is going to result in a full > tree operation), implement a reasonable compromise that addresses > the deficiency in practice without blowing out the size of the > transactions. > > Refactor the inode transaction reservation code to include one > allocfree res. per inode btree modification to cover allocations > required by the tree itself. This essentially separates the > reservation required to allocate the physical inode chunk from > additional reservation required to perform inobt record > insertion/removal. I think you missed the most important reason the inobt/finobt modifications need there own allocfree reservation - btree modifications that cause btree blocks to be freed do not use defered ops and so the freeing of blocks occurs directly within the primary transaction. Hence the primary transaction reservation must take this into account .... > Apply the same logic to the finobt reservation. > This results in killing off the finobt modify condition because we > no longer assume that the broader transaction reservation will cover > finobt block allocations. > > Suggested-by: Dave Chinner <david@xxxxxxxxxxxxx> > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> Code looks fine. Comments below are for another follow-on patch. Reviewed-by: Dave Chinner <dchinner@xxxxxxxxxx> > @@ -387,8 +386,8 @@ xfs_calc_create_resv_modify( > * the agi and agf of the ag getting the new inodes: 2 * sectorsize > * the superblock for the nlink flag: sector size > * the inode blocks allocated: mp->m_ialloc_blks * blocksize > - * the inode btree: max depth * blocksize > * the allocation btrees: 2 trees * (max depth - 1) * block size > + * the inode btree (record insertion) > */ > STATIC uint > xfs_calc_create_resv_alloc( > @@ -397,9 +396,9 @@ xfs_calc_create_resv_alloc( > return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) + > mp->m_sb.sb_sectsize + > xfs_calc_buf_res(mp->m_ialloc_blks, XFS_FSB_TO_B(mp, 1)) + > - xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) + > xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1), > - XFS_FSB_TO_B(mp, 1)); > + XFS_FSB_TO_B(mp, 1)) + > + xfs_calc_inobt_res(mp); > } Looking at this, I'm wondering if there should also be a function for calculating the inode chunk reservation. Something like: static uint xfs_calc_inode_chunk_res( struct xfs-mount *mp, bool chunk_contents_logged) { uint res; if (chunk_contents_logged) { res = xfs_calc_buf_res(mp->m_ialloc_blks, XFS_FSB_TO_B(mp, 1)); } else { /* take into account logged headers for freeing */ res = xfs_calc_buf_res(mp->m_ialloc_blks, 0); } res += xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1), XFS_FSB_TO_B(mp, 1)); return res; } Then xfs_calc_create_resv_alloc() reads like: return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) + mp->m_sb.sb_sectsize + xfs_calc_inode_chunk_res(mp, true) + xfs_calc_inobt_res(mp); > > STATIC uint > @@ -415,8 +414,8 @@ __xfs_calc_create_reservation( > * For icreate we can allocate some inodes giving: > * the agi and agf of the ag getting the new inodes: 2 * sectorsize > * the superblock for the nlink flag: sector size > - * the inode btree: max depth * blocksize > * the allocation btrees: 2 trees * (max depth - 1) * block size > + * the inobt (record insertion) > * the finobt (record insertion) > */ > STATIC uint > @@ -425,10 +424,10 @@ xfs_calc_icreate_resv_alloc( > { > return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) + > mp->m_sb.sb_sectsize + > - xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) + > xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1), > XFS_FSB_TO_B(mp, 1)) + > - xfs_calc_finobt_res(mp, 0, 0); > + xfs_calc_inobt_res(mp) + > + xfs_calc_finobt_res(mp); > } return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) + mp->m_sb.sb_sectsize + xfs_calc_inode_chunk_res(mp, false) + xfs_calc_inobt_res(mp) + xfs_calc_finobt_res(mp); > > STATIC uint > @@ -494,9 +493,14 @@ xfs_calc_symlink_reservation( > * the agi hash list and counters: sector size > * the on disk inode before ours in the agi hash list: inode cluster size > * the inode chunk is marked stale (headers only) > - * the inode btree: max depth * blocksize > - * the allocation btrees: 2 trees * (max depth - 1) * block size > + * the inode btree > * the finobt (record insertion, removal or modification) > + * > + * Note that the allocfree res. for the inode chunk itself is not included > + * because the extent free occurs after a transaction roll. We could take the > + * maximum of the pre/post roll operations, but the pre-roll reservation already > + * includes at least one allocfree res. for the inobt and is thus guaranteed to > + * be larger. > */ > STATIC uint > xfs_calc_ifree_reservation( > @@ -508,10 +512,8 @@ xfs_calc_ifree_reservation( > xfs_calc_iunlink_remove_reservation(mp) + > xfs_calc_buf_res(1, 0) + > xfs_calc_buf_res(mp->m_ialloc_blks, 0) + > - xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) + > - xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1), > - XFS_FSB_TO_B(mp, 1)) + > - xfs_calc_finobt_res(mp, 0, 1); > + xfs_calc_inobt_res(mp) + > + xfs_calc_finobt_res(mp); > } ..... xfs_calc_iunlink_remove_reservation(mp) + xfs_calc_buf_res(1, 0) + xfs_calc_inode_chunk_res(mp, false) + xfs_calc_inobt_res(mp) + xfs_calc_finobt_res(mp); This seems to make more sense to me - it's clear what the individual components of the reservation are, and we can ensure that the individual components have the necessary reservation independently of the overall reservations that need them. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html