Re: Some jbd2 philosophy about credits estimation and other things

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Thu, 6 Jul 2017 10:19:06 -0700

On Tue, Jul 04, 2017 at 01:54:41PM -0400, Theodore Ts'o wrote:
> I'm going to be making some changes to two recent feature patches (the
> largedir feature patch and the xattr duplication patch) to fix up some
> potential bugs caused by under-estimating the number of credits needed
> which can cause ext4 to get quite unhappy:
> 
> EXT4-fs: ext4_xattr_block_set:2023: aborting transaction: error 28 in __ext4_handle_dirty_metadata
> EXT4-fs error (device dm-2): ext4_xattr_block_set:2023: inode #131078: block 589866: comm fsstress: journal_dirty_metadata failed: handle type 10 started at line 2411, credits 5/0, errcode -28
> 
> While I was looking into this, I realized that there are some
> implementation details about how the jbd2 layer works that haven't
> written down, and it might be useful to document it for everyone.
> (This probably needs to go into the ext4 wiki, with some editing.)

Additionally ext4-jbd2.h ?

> In general, under-estimating of the number of credits needed is far
> worse than over-estimating.  Under-estimating can cause the above,
> which will end up marking the file system corrupt.  We can actually do
> better; in fact, we probably ought to do is to try to see if we can
> extend the transaction, print an ext4 warning plus a WARN_ON(1) to get

I don't know if that's a good idea -- I'd rather train the developers
that they cannot underestimate ever than let them paper over things like
this that will eventually blow out on someone else's machine.

But I guess a noisy WARN would get peoples' attention.

> a stack dump.  If we can't continue, we will have to force an
> ext4-error and abort the journal, since if we run out of space, we can
> end up in a deadlock: in order to free space in the journal, we need
> to do a commit, but we can't do a commit because handles are open and
> we can't complete them because we're out of space.
> 
> Also, a typical transaction has room for thousands of blocks, and a
> handle is usually only open for a very short time.  Once a handle is
> stopped, any unused credits are released back to the journal.  For
> this reason, if we estimate that we need to modify 20 blocks
> (credits), but we only need to modify 8 blocks, 4 of which have
> already been modify in the current transaction --- this is not a
> tragedy.  We only used up 4 new blocks in the journal by the time the
> handle is stopped, but that's OK.  Once the handle is stopped the
> "cost" of the over-estimation is gone.
> 
> What is the "cost" of the over-estimation?  Well, we need to make sure
> that there will be enough room in the journal to write the current
> transaction (plus have some slack space for the next transaction).  So
> if we vastly over-estimate the number of blocks needed when we start a
> handle, and the sum of all of the credits of currently started handles
> is greater than free space in the journal minus some slack space,
> instead of starting the new handle, the thread will block, and all
> attempts to start new handles will block, until all of the currently
> running handles have completed and a current transaction can start
> committing.
> 
> So over-estimating by a few 10's of credits is probably not noticeable
> at all.  Over-estimating by hundreds of credits can start causing
> performance impacts.  How?  By forcing transaction to close earlier
> than the normal 5 second timeout due of a perceived lack of space,
> when the journal is almost full due to a credit over-estimation.  Even
> so, the performance impact is not necessarily that great, and
> typically only shows up under heavy load --- and we or the system
> administrator can mitigate this problem fairly easily by increasing
> the journal size.

/me wonders if it'd be useful to track how closely the estimates fit
reality in debugfs to make it easier to spot-check how good of a job
we're all doing?

> So while there may be a desire out of professional pride to make the
> credit estimation be as accurate as possible, getting the credit
> estimation exactly write is not really critical.  It's for this reason
> that we always add EXT4_INDEX_EXTRA_TRANS_BLOCKS to the credits
> without testing to see if the HTree feature is enabled.  An extra 8
> (soon to be 12) blocks doesn't really matter all that much.

As general documentation, this seems fine as-is up to here.  You could
probably rework the last two paragraphs too.

--D

> The reason why we do try to do conditionalized credits based on quotas
> is because when quota is enabled, some of the calculations for the
> number of credits that are needed were extremely high.  This was
> partially addressed by Jan Kara's change to move the call to
> dquot_initialize to a separate transaction.  (eb9cc7e16b3: ext4: move
> quota initialization out of inode allocation transaction)
> 
> The inode allocation path is still one where if we can reduce number
> of credits needed, it can be a win.  (It currently has a worst case of
> around 200 blocks, per Jan's comments in eb9cc7e16b3.)  Fortunately,
> there are some assumptions we can make about inode allocation to
> simplify things: (a) the acl will be a duplicate of the directory's
> default acl; (b) the SELinux label is generally not that huge, and (c)
> the encryption metadata is also fairly small, and (d) there is no
> pre-existing inline data to worry about.  All of this means that we
> don't need to worry about the inline data or EA-in-inode features, and
> so we can probably rely on EXT4_XATTR_TRANS_BLOCKS estimate instead of
> using more complex calculations found in __ext4_xattr_set_credits().
> 
> Cheers,
> 
> 						- Ted