On Mon, Aug 19, 2019 at 10:57:59AM +0200, Jan Kara wrote: > Hello, > > I've recently got a bug report where JBD2 assertion failed due to > transaction commit running out of journal space. After closer inspection of > the crash dump it seems that the problem is that there were too many > journal descriptor blocks (more that max_transaction_size >> 5 + 32 we > estimate in jbd2_log_space_left()) due to descriptor blocks with revoke > records. In fact the estimate on the number of descriptor blocks looks > pretty arbitrary and there can be much more descriptor blocks needed for > revoke records. We need one revoke record for every metadata block freed. > So in the worst case (1k blocksize, 64-bit journal feature enabled, > checksumming enabled) we fit 125 revoke record in one descriptor block. In > common cases its about 500 revoke records per descriptor block. Now when > we free large directories or large file with data journalling enabled, we can > have *lots* of blocks to revoke - with extent mapped files easily millions > in a single transaction which can mean 10k descriptor blocks - clearly more > than the estimate of 128 descriptor blocks per transaction ;) Can jbd2 make the jbd2_journal_revoke caller wait until it has checkpointed the @blocknr block if it has run out of revoke record space? > Now users clearly don't hit this problem frequently so this is not common > case but still it is possible and malicious user could use this to DoS the > machine so I think we need to get even the weird corner-cases fixed. The > question is how because as sketched above the worst case is too bad to > account for in the common case. I have considered three options: > > 1) Count number of revoke records currently in the transaction and add > needed revoke descriptor blocks to the expected transaction size. This is > easy enough but does not solve all the corner cases - single handle > can add lot of revoke blocks which may overflow the space we reserve for > descriptor blocks. > > 2) Add argument to jbd2_journal_start() telling how many metadata blocks we > are going to free and we would account necessary revoke descriptor blocks > into reserved credits. This could work, we would generally need to pass > inode->i_blocks / blocksize as the estimate of metadata blocks to free (for > inodes to which this applies) as we don't have better estimate but I guess > that's bearable. It would require some changes on ext4 side but not too > intrusive. What happens if iblocks / blocksize revoke records exceeds the size of the journal? --D > 3) Use the fact that we need to revoke only blocks that are currently in > the journal. Thus the number of revoke records we really may need to store > is well bound (by the journal size). What is a bit painful is tracking of > which blocks are journalled. We could use a variant of counting Bloom > filters to store that information with low memory consumption (say 64k of > memory in common case) and high-enough accuracy but still that will be some > work to write. On the plus side it would reduce the amount revoke records > we have to store even in common case. > > Overall I'm probably leaning towards 2) but I'm happy to hear more opinions > or ideas :) > > Honza > -- > Jan Kara <jack@xxxxxxxx> > SUSE Labs, CR