[patched in the extra case from your subsequent reply] On Tue, Feb 18, 2014 at 12:10:16PM -0500, Brian Foster wrote: > On 02/11/2014 01:46 AM, Dave Chinner wrote: > > On Tue, Feb 04, 2014 at 12:49:35PM -0500, Brian Foster wrote: > >> Create the xfs_calc_finobt_res() helper to calculate the finobt log > >> reservation for inode allocation and free. Update > >> XFS_IALLOC_SPACE_RES() to reserve blocks for the additional finobt > >> insertion on inode allocation. Create XFS_IFREE_SPACE_RES() to > >> reserve blocks for the potential finobt record insertion on inode > >> free (i.e., if an inode chunk was previously fully allocated). > >> > >> Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> > >> --- > >> fs/xfs/xfs_inode.c | 4 +++- > >> fs/xfs/xfs_trans_resv.c | 47 +++++++++++++++++++++++++++++++++++++++++++---- > >> fs/xfs/xfs_trans_space.h | 7 ++++++- > >> 3 files changed, 52 insertions(+), 6 deletions(-) > >> > >> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > >> index 001aa89..57c77ed 100644 > >> --- a/fs/xfs/xfs_inode.c > >> +++ b/fs/xfs/xfs_inode.c > >> @@ -1730,7 +1730,9 @@ xfs_inactive_ifree( > >> int error; > >> > >> tp = xfs_trans_alloc(mp, XFS_TRANS_INACTIVE); > >> - error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree, 0, 0); > >> + tp->t_flags |= XFS_TRANS_RESERVE; > >> + error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree, > >> + XFS_IFREE_SPACE_RES(mp), 0); > > > > Can you add a comment explaining why the XFS_TRANS_RESERVE flag is > > used here, and why it's use won't lead to accelerated reserve pool > > depletion? > > > > So this aspect of things appears to be a bit more interesting than I > originally anticipated. I "reserve enabled" this transaction to > facilitate the ability to free up inodes under ENOSPC conditions. > Without this, the problem of failing out of xfs_inactive_ifree() (and > leaving an inode chained on the unlinked list) is easily reproducible > with generic/083. *nod* > The basic argument for why this is reasonable is that releasing an inode > releases used space (i.e., file blocks and potentially directory blocks > and inode chunks over time). That said, I can manufacture situations > where this is not the case. E.g., allocate a bunch of 0-sized files, > consume remaining free space in some separate file, start removing > inodes in a manner that removes a single inode per chunk or so. This > creates a scenario where the inobt can be very large and the finobt very > small (likely a single record). Removing the inodes in this manner > reduces the likelihood of freeing up any space and thus rapidly grows > the finobt towards the size of the inobt without any free space > available. This might or might not qualify as sane use of the fs, but I > don't think the failure scenario is acceptable as things currently stand. Right, that can happen. But my question is this: how realistic is it that we have someone who has ENOSPC because of enough zero length files to trigger this? I've never seen an application or user try to store any significant number of zero length files, so I suspect this is a theoretical problem, not a practical one. Indeed, the finobt only needs to grow a block for every 250-odd records for a 4k block size filesystem. Hence, IMO, the default reserve pool size of 8192 filesystem blocks is going be sufficient for most users. i.e. the case you are talking about requires (ignoring node block usage for simplicity) 250 * 8192 * 64 = 131 million zero length inodes to be present in the filesystem to have this "1 inode per chunk" freeing pattern exhaust the default reserve pool with finobt tree allocations.... > I think there are several ways this can go from here. A couple ideas > that have crossed my mind: > > - Find a way to variably reserve the number of blocks that would be > required to grow the finobt to the finobt, based on current state. This Not sure what "grow the finobt to the finobt" means. There's a typo or a key word missing there ;) > would require the total number of blocks (not just enough for a split), > so this could get complex and somewhat overbearing (i.e., a lot of space > could be quietly reserved, current tracking might not be sufficient and > the allocation paths could get hairy). Doesn't seem worth the complexity to me. > - Work to push the ifree transaction allocation and reservation to the > unlink codepath rather than the eviction codepath. Under normal > circumstances, chain the tp to the xfs_inode such that the eviction code > path can grab it and run. This prevents us going into the state where an > inode is unlinked without having enough space to free up. On the flip > side, ENOSPC on unlink isn't very forgiving behavior to the user. That's the long term plan anyway - to move to background freeing of the inodes once they are on the unlinked list and unreferenced by the VFS. But, really, once the inode is on the unlinked list we can probably ignore the ENOSPC problem because we know that it is unlinked. Indeed, the long term plan (along with background freeing) is to allow inode allocation direct from the unlinked lists, and that means we could leave the inodes on the unlinked lists and not care about the ENOSPC problem at all ;) > - Add some state or flags bits to the finobt and the associated > ability to kill/invalidate it at runtime. Print a warning with > regard to the situation that indicates performance might be > affected and a repair is required to re-enable. We've already got that state through the unlinked lists. Again, go back to the RFD series and look through the followup work.... > > I think the former approach is probably overkill for something that > might be a pathological situation. The latter approach is more simple, > but it feels like a bit of a hack. I've experimented with it a bit, but > I'm not quite sure yet if it introduces any transaction issues by > allocating the unlink and ifree transactions at the same time. > > Perhaps another argument could be made that it's rather unlikely we run > into an fs with as many 0-sized (or sub-inode chunk sized) files as > required to deplete the reserve pool without freeing any space, and we > should just touch up the failure handling. E.g., > > 1.) Continue to reserve enable the ifree transaction. Consider expanding > the reserve pool on finobt-enabled fs' if appropriate. Note that this is > not guaranteed to provide enough resources to populate the finobt to the > level of the inobt without freeing up more space. > 2.) Attempt a !XFS_TRANS_RESERVE tp reservation in xfs_inactive_ifree(). > If fails, xfs_warn()/notice() and enable XFS_TRANS_RESERVE. > 3.) Attempt XFS_TRANS_RESERVE reservation. If fails, xfs_notice() and > shutdown. I don't think we ned to shut down. Indeed, there's no point in doing an !XFS_TRANS_RESERVE in the first place because a warning will just generate unnecessary noise in the logs. Realistically, we can leave inodes on the unlinked list indefinitely without causing any significant problems except for there being used space that users can't account for from the namespace. Log recovery cleans them up when it runs, or blows away the unlinked list when it fails, and that results in leaked inodes. If we get to that point, xfs-repair will clean it up just fine unless there's still not enough space. At that point, it's not a problem we can solve with tools - the user has to free up some space in the filesystem.... > And this could probably be made more intelligent to bail out sooner if > we repeat XFS_TRANS_RESERVE reservations without freeing up any space, > etc. Before going too far in one direction... thoughts? Right now, I just don't think it is a case we need to be pariticularly concerned with. There are plenty of theoretical issues that can occur (including data loss) when the reserve pool is depleted because of prolonged ENOSPC issues, but the reality is that the only place we see this code being exercised is by the tests in xfstests that intentionally trigger reserve pool depletion.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs