Re: JBD2 transaction running out of space

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Mon, 19 Aug 2019 09:17:05 -0700

On Mon, Aug 19, 2019 at 10:57:59AM +0200, Jan Kara wrote:
> Hello,
> 
> I've recently got a bug report where JBD2 assertion failed due to
> transaction commit running out of journal space. After closer inspection of
> the crash dump it seems that the problem is that there were too many
> journal descriptor blocks (more that max_transaction_size >> 5 + 32 we
> estimate in jbd2_log_space_left()) due to descriptor blocks with revoke
> records. In fact the estimate on the number of descriptor blocks looks
> pretty arbitrary and there can be much more descriptor blocks needed for
> revoke records. We need one revoke record for every metadata block freed.
> So in the worst case (1k blocksize, 64-bit journal feature enabled,
> checksumming enabled) we fit 125 revoke record in one descriptor block.  In
> common cases its about 500 revoke records per descriptor block. Now when
> we free large directories or large file with data journalling enabled, we can
> have *lots* of blocks to revoke - with extent mapped files easily millions
> in a single transaction which can mean 10k descriptor blocks - clearly more
> than the estimate of 128 descriptor blocks per transaction ;)

Can jbd2 make the jbd2_journal_revoke caller wait until it has
checkpointed the @blocknr block if it has run out of revoke record
space?

> Now users clearly don't hit this problem frequently so this is not common
> case but still it is possible and malicious user could use this to DoS the
> machine so I think we need to get even the weird corner-cases fixed. The
> question is how because as sketched above the worst case is too bad to
> account for in the common case. I have considered three options:
> 
> 1) Count number of revoke records currently in the transaction and add
> needed revoke descriptor blocks to the expected transaction size. This is
> easy enough but does not solve all the corner cases - single handle
> can add lot of revoke blocks which may overflow the space we reserve for
> descriptor blocks.
> 
> 2) Add argument to jbd2_journal_start() telling how many metadata blocks we
> are going to free and we would account necessary revoke descriptor blocks
> into reserved credits. This could work, we would generally need to pass
> inode->i_blocks / blocksize as the estimate of metadata blocks to free (for
> inodes to which this applies) as we don't have better estimate but I guess
> that's bearable. It would require some changes on ext4 side but not too
> intrusive.

What happens if iblocks / blocksize revoke records exceeds the size of
the journal?

--D

> 3) Use the fact that we need to revoke only blocks that are currently in
> the journal. Thus the number of revoke records we really may need to store
> is well bound (by the journal size). What is a bit painful is tracking of
> which blocks are journalled. We could use a variant of counting Bloom
> filters to store that information with low memory consumption (say 64k of
> memory in common case) and high-enough accuracy but still that will be some
> work to write. On the plus side it would reduce the amount revoke records
> we have to store even in common case.
> 
> Overall I'm probably leaning towards 2) but I'm happy to hear more opinions
> or ideas :)
> 
> 								Honza
> -- 
> Jan Kara <jack@xxxxxxxx>
> SUSE Labs, CR