On Mon 19-08-19 09:17:05, Darrick J. Wong wrote: > On Mon, Aug 19, 2019 at 10:57:59AM +0200, Jan Kara wrote: > > Hello, > > > > I've recently got a bug report where JBD2 assertion failed due to > > transaction commit running out of journal space. After closer inspection of > > the crash dump it seems that the problem is that there were too many > > journal descriptor blocks (more that max_transaction_size >> 5 + 32 we > > estimate in jbd2_log_space_left()) due to descriptor blocks with revoke > > records. In fact the estimate on the number of descriptor blocks looks > > pretty arbitrary and there can be much more descriptor blocks needed for > > revoke records. We need one revoke record for every metadata block freed. > > So in the worst case (1k blocksize, 64-bit journal feature enabled, > > checksumming enabled) we fit 125 revoke record in one descriptor block. In > > common cases its about 500 revoke records per descriptor block. Now when > > we free large directories or large file with data journalling enabled, we can > > have *lots* of blocks to revoke - with extent mapped files easily millions > > in a single transaction which can mean 10k descriptor blocks - clearly more > > than the estimate of 128 descriptor blocks per transaction ;) > > Can jbd2 make the jbd2_journal_revoke caller wait until it has > checkpointed the @blocknr block if it has run out of revoke record > space? That would be really hard to implement without introducing deadlocks (checkpoint of a transaction may need to wait for currently committing transaction to finish commit in some cases). Also as you mention below, it isn't even guaranteed revoke descriptor blocks fit into a journal if we don't limit them in some way. > > Now users clearly don't hit this problem frequently so this is not common > > case but still it is possible and malicious user could use this to DoS the > > machine so I think we need to get even the weird corner-cases fixed. The > > question is how because as sketched above the worst case is too bad to > > account for in the common case. I have considered three options: > > > > 1) Count number of revoke records currently in the transaction and add > > needed revoke descriptor blocks to the expected transaction size. This is > > easy enough but does not solve all the corner cases - single handle > > can add lot of revoke blocks which may overflow the space we reserve for > > descriptor blocks. > > > > 2) Add argument to jbd2_journal_start() telling how many metadata blocks we > > are going to free and we would account necessary revoke descriptor blocks > > into reserved credits. This could work, we would generally need to pass > > inode->i_blocks / blocksize as the estimate of metadata blocks to free (for > > inodes to which this applies) as we don't have better estimate but I guess > > that's bearable. It would require some changes on ext4 side but not too > > intrusive. > > What happens if iblocks / blocksize revoke records exceeds the size of > the journal? That's a good point. Doing some math this could happen when we have e.g. a file with journalled data that is couple GB large. However looking into the code we could use the fact that we actually truncate file one-extent-at-a-time, thus we in fact know exactly how many blocks we are going to free and maximum number of blocks in an extent (65535) generates ~524 revoke descriptor blocks in the worst case which still reasonably fits within a transaction. So this seems fixable. Thanks for input! Honza > > 3) Use the fact that we need to revoke only blocks that are currently in > > the journal. Thus the number of revoke records we really may need to store > > is well bound (by the journal size). What is a bit painful is tracking of > > which blocks are journalled. We could use a variant of counting Bloom > > filters to store that information with low memory consumption (say 64k of > > memory in common case) and high-enough accuracy but still that will be some > > work to write. On the plus side it would reduce the amount revoke records > > we have to store even in common case. > > > > Overall I'm probably leaning towards 2) but I'm happy to hear more opinions > > or ideas :) > > > > Honza > > -- > > Jan Kara <jack@xxxxxxxx> > > SUSE Labs, CR -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR