On Tue, Oct 27, 2020 at 7:29 AM Jan Kara <jack@xxxxxxx> wrote: > > On Mon 26-10-20 13:55:47, harshad shirwadkar wrote: > > > > + * > > > > + * The fast commit space at the end of above operations would look like this: > > > > + * [HEAD] [CREAT A] [UNLINK B] [TAIL] [ADD_RANGE A] [DEL_RANGE A] [TAIL] > > > > + * |<--- Fast Commit 1 --->|<--- Fast Commit 2 ---->| > > > > + * > > > > + * Replay code should thus check for all the valid tails in the FC area. > > > > > > And one design question: Why do we record unlink of B here? I was kind of > > > hoping that fastcommit due to fsync(A) would record only operations related > > > to A. Because the way you wrote it, fast commit is inherently still a > > > filesystem-global operation requiring global ordering of metadata changes > > > with all the scalability bottlenecks current journalling code has. It's > > > faster by some factor due to more efficient packing of "small" changes not > > > fundamentally faster AFAICT... > > So given that fsync() for Ext4 traditionally resulted in syncing of > > all the dirty inodes / buffers. If we fsync() only the file in > > question, I'm worried that we may break some of the existing > > applications. In the earlier version of the series, I had a > > "soft_consistency" mode which did exactly that. It broke a bunch of > > xfstests that had this assumption. Also, in my tests I didn't see a > > big performance difference between these fast commits and the fast > > commits with soft consistency. Most probably, that's because the > > benchmarks perform a fsync on all the files and current fast commits > > give it a batching effect which soft consistency mode would fail to > > provide. > > > > But I'm not fixated on this, I think it's still good to have > > soft_consistency mode. Good thing is this doesn't affect the on-disk > > format. So, this is something that can be gradually added to Ext4. > > OK, I see. Maybe add a paragraph about this to fastcommit doc? I agree that > we can leave these optimizations for later, I was just wondering whether > there isn't some fundamental reason why global flush would be required and > I'm happy to hear that there isn't. Ack > > The advantage of soft_consistency as you call it would be IMO most seen if > there's relatively heavy non-fsync IO load in parallel with frequent fsyncs > of a tiny file. And such load is not infrequent in practice. I agree that > benchmarks like dbench are unlikely to benefit from soft_consistency since > all IO the benchmark does is in fact forced by fsync. > > I also think that with soft_consistency we could benefit (e.g. on SSD > storage) from having several fast-commit areas in the journal so multiple > fastcommits can run in parallel. But that's also for some later > experimentation... Yeah makes sense. > > > > > + > > > > + mutex_lock(&ei->i_fc_lock); > > > > + if (running_txn_tid == ei->i_sync_tid) { > > > > + update = true; > > > > + } else { > > > > + ext4_fc_reset_inode(inode); > > > > + ei->i_sync_tid = running_txn_tid; > > > > + } > > > > + ret = __fc_track_fn(inode, args, update); > > > > + mutex_unlock(&ei->i_fc_lock); > > > > + > > > > + if (!enqueue) > > > > + return ret; > > > > + > > > > + spin_lock(&sbi->s_fc_lock); > > > > + if (list_empty(&EXT4_I(inode)->i_fc_list)) > > > > + list_add_tail(&EXT4_I(inode)->i_fc_list, > > > > + (sbi->s_mount_state & EXT4_FC_COMMITTING) ? > > > > + &sbi->s_fc_q[FC_Q_STAGING] : > > > > + &sbi->s_fc_q[FC_Q_MAIN]); > > > > + spin_unlock(&sbi->s_fc_lock); > > > > > > OK, so how do you prevent inode from being freed while it is still on > > > i_fc_list? I don't see anything preventing that and it could cause nasty > > > use-after-free issues. Note that for similar reasons JBD2 uses external > > > separately allocated inode for jbd2_inode so that it can have separate > > > lifetime (related to transaction commits) from struct ext4_inode_info. > > So, if you see the function ext4_fc_del() above, it's called from > > ext4_clear_inode(). What ext4_fc_del() does is that, if the inode is > > not being committed, it just removes it from the list. If that inode > > was deleted, we have a separate dentry queue which will record the > > deletion of the inode, so we don't really need the struct > > ext4_inode_info for recording that on-disk. However, if the inode is > > being committed (this is figured out by checking the per inode > > COMMITTING state), ext4_fc_del() waits until the completion. > > But I don't think this quite works. Consider the following scenario: > > inode I gets modified in transaction T > you add I to FC list > > memory pressure reclaims I from memory > you remove I from FC list > > open(I) -> inode gets loaded to memory again. Not tracked in FC list. > fsync(I) -> nothing to do, FC list is empty > <crash> > > And 'I' now doesn't contain data in should because T didn't commit yet and > FC was empty. Hmmm, I see. This needs to get fixed. However, I'm a little confused here. On memory pressure, the call chain would be like: VFS->ext4_evict_inode() -> ext4_free_inode() -> ext4_clear_inode(). In ext4_clear_inode(), we free up the jbd2_inode as well. If that's the case, how does jbd2_inode survive the memory pressure where its corresponding VFS inode is freed up? Assuming I'm missing something, one option would be to track jbd2_inode in the FC list instead of ext4_inode_info? Would that take care of the problem? Another option would be to trigger a fast_commit from ext4_evict_inode if the inode being freed is on fc list. But I'm worried that would increase the latency of unlink operation. > > > > > + > > > > + return ret; > > > > +} > > > > + > > > > +struct __track_dentry_update_args { > > > > + struct dentry *dentry; > > > > + int op; > > > > +}; > > > > + > > > > +/* __track_fn for directory entry updates. Called with ei->i_fc_lock. */ > > > > +static void ext4_fc_submit_bh(struct super_block *sb) > > > > +{ > > > > + int write_flags = REQ_SYNC; > > > > + struct buffer_head *bh = EXT4_SB(sb)->s_fc_bh; > > > > + > > > > + if (test_opt(sb, BARRIER)) > > > > + write_flags |= REQ_FUA | REQ_PREFLUSH; > > > > > > Submitting each fastcommit buffer with REQ_FUA | REQ_PREFLUSH is > > > unnecessarily expensive (especially if there will be unrelated writes > > > happening to the filesystem while fastcommit is running). If nothing else, > > > it's enough to have REQ_PREFLUSH only once during the whole fastcommit to > > > flush out written back data blocks (plus journal device may be different > > > from the filesystem device so you need to be flushing the filesystem device > > > for this - see how the jbd2 commit code does this). > > > > > > Also REQ_FUA on each block may be overkill for devices that don't support > > > it natively (and thus REQ_FUA is simulated with full write cache pre and > > > post flush) - for such devices it would be better to just write out > > > fastcommit normally and then issue one cache flush. With careful > > > checksumming, block ID tagging and such, it should be safe against disk > > > reordering writes. But I guess we can leave this optimization as a TODO > > > item for later (but I think it would be good to design the on-disk format of > > > fastcommit blocks so that it does not rely on FUA writes). > > I see. The on disk format doesn't rely on FUA / PREFLUSH, I added it > > based on the observation that in most cases all the fast commit info > > was written in 1 block only. I didn't see much difference in the > > performance but I get your point. I'll add this as a TODO in the code > > for now. > > OK, the performance optimization can wait for later but the flushing of > proper device needs to be fixed soon - as I wrote above REQ_PREFLUSH is not > enough (and needed at all) when the journal device is different from the > filesystem device. Ack > > > > > +/* > > > > + * Complete a fast commit by writing tail tag. > > > > + * > > > > + * Writing tail tag marks the end of a fast commit. In order to guarantee > > > > + * atomicity, after writing tail tag, even if there's space remaining > > > > + * in the block, next commit shouldn't use it. That's why tail tag > > > > + * has the length as that of the remaining space on the block. > > > > + */ > > > > +static int ext4_fc_write_tail(struct super_block *sb, u32 crc) > > > > +{ > > > > + struct ext4_sb_info *sbi = EXT4_SB(sb); > > > > + struct ext4_fc_tl tl; > > > > + struct ext4_fc_tail tail; > > > > + int off, bsize = sbi->s_journal->j_blocksize; > > > > + u8 *dst; > > > > + > > > > + /* > > > > + * ext4_fc_reserve_space takes care of allocating an extra block if > > > > + * there's no enough space on this block for accommodating this tail. > > > > + */ > > > > + dst = ext4_fc_reserve_space(sb, sizeof(tl) + sizeof(tail), &crc); > > > > + if (!dst) > > > > + return -ENOSPC; > > > > + > > > > + off = sbi->s_fc_bytes % bsize; > > > > + > > > > + tl.fc_tag = cpu_to_le16(EXT4_FC_TAG_TAIL); > > > > + tl.fc_len = cpu_to_le16(bsize - off - 1 + sizeof(struct ext4_fc_tail)); > > > > + sbi->s_fc_bytes = round_up(sbi->s_fc_bytes, bsize); > > > > + > > > > + ext4_fc_memcpy(sb, dst, &tl, sizeof(tl), &crc); > > > > + dst += sizeof(tl); > > > > + tail.fc_tid = cpu_to_le32(sbi->s_journal->j_running_transaction->t_tid); > > > > + ext4_fc_memcpy(sb, dst, &tail.fc_tid, sizeof(tail.fc_tid), &crc); > > > > + dst += sizeof(tail.fc_tid); > > > > + tail.fc_crc = cpu_to_le32(crc); > > > > + ext4_fc_memcpy(sb, dst, &tail.fc_crc, sizeof(tail.fc_crc), NULL); > > > > + > > > > + ext4_fc_submit_bh(sb); > > > > + > > > > + return 0; > > > > +} > > > > > > Is there a reason to pass CRC all around (so you have to have special > > > functions like ext4_fc_memcpy(), ext4_fc_memzero(), ...) instead of just > > > creating the whole block and then computing CRC in one go? > > > > > > In fact, as looking through the code, it seems to me it would be slightly > > > nicer layer separation and interface if JBD2 provided functions for storage > > > of data blobs and handled the details of space & block management, > > > checksums, writeout, on recovery verification of correctness (so it would > > > just provide back a stream of blobs for FS to replay). Just an idea for > > > consideration, the current interface isn't too bad and we can change it > > > later if we decide so. > > I designed this keeping DAX mode in mind where we would benefit if we > > don't use buffer heads and blocks. There is no block level CRC, but > > CRC covers all the tags either from the start or from the last tail > > tag (whichever comes first). This kind of CRC can span across > > multipleblocks or we could have multiple CRCs in one block. Passing > > CRC around helps us to compute CRC as we write tags to storage. In DAX > > mode, this would allow fast commit commits to be smaller than block > > size. DAX mode code isn't implemented completely yet, but I wanted to > > make sure that the design of on-disk format is consistent and > > efficient for both DAX and non-DAX modes. > > OK, I understand. Thanks for explanation! > > > > > + > > > > +/* > > > > + * Adds tag, length, value and updates CRC. Returns true if tlv was added. > > > > + * Returns false if there's not enough space. > > > > + */ > > > > + */ > > > > +static int ext4_fc_write_inode(struct inode *inode, u32 *crc) > > > > +{ > > > > + struct ext4_inode_info *ei = EXT4_I(inode); > > > > + int inode_len = EXT4_GOOD_OLD_INODE_SIZE; > > > > + int ret; > > > > + struct ext4_iloc iloc; > > > > + struct ext4_fc_inode fc_inode; > > > > + struct ext4_fc_tl tl; > > > > + u8 *dst; > > > > + > > > > + ret = ext4_get_inode_loc(inode, &iloc); > > > > + if (ret) > > > > + return ret; > > > > + > > > > + if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) > > > > + inode_len += ei->i_extra_isize; > > > > + > > > > + fc_inode.fc_ino = cpu_to_le32(inode->i_ino); > > > > + tl.fc_tag = cpu_to_le16(EXT4_FC_TAG_INODE); > > > > + tl.fc_len = cpu_to_le16(inode_len + sizeof(fc_inode.fc_ino)); > > > > + > > > > + dst = ext4_fc_reserve_space(inode->i_sb, > > > > + sizeof(tl) + inode_len + sizeof(fc_inode.fc_ino), crc); > > > > + if (!dst) > > > > + return -ECANCELED; > > > > + > > > > + if (!ext4_fc_memcpy(inode->i_sb, dst, &tl, sizeof(tl), crc)) > > > > + return -ECANCELED; > > > > + dst += sizeof(tl); > > > > + if (!ext4_fc_memcpy(inode->i_sb, dst, &fc_inode, sizeof(fc_inode), crc)) > > > > + return -ECANCELED; > > > > + dst += sizeof(fc_inode); > > > > + if (!ext4_fc_memcpy(inode->i_sb, dst, (u8 *)ext4_raw_inode(&iloc), > > > > + inode_len, crc)) > > > > + return -ECANCELED; > > > > > > Isn't this racy? What guarantees the inode state you record here is a valid > > > one for the fastcommit? I mean this gets called at the time of fastcommit > > > (i.e., fsync), so a fastcommit code must record changes to all other > > > metadata that relate to the currently recorded inode state. But this isn't > > > serialized in any way (AFAICT) with on-going inode changes so how can > > > fastcommit code guarantee that? This is a similar case as a problem I > > > describe below... > > So we have ext4_fc_start_update(inode) / ext4_fc_stop_update(inode) > > which is called by all the operations that happen on an inode. If the > > inode in question is undergoing a fast commit, ext4_fc_start_update() > > will block. So that ensures that inode won't be modified once fast > > commit starts. So, in general, before doing any fast commit related > > operation, we'll first put the inode in committing state, that's the > > state of the inode which will be committed on-disk in fast commit. > > I see. See the case below for my comments. > > > > > + > > > > + return 0; > > > > +} > > > > + > > > > +/* > > > > + * Writes updated data ranges for the inode in question. Updates CRC. > > > > + * Returns 0 on success, error otherwise. > > > > + */ > > > > +static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc) > > > > +{ > > > > + ext4_lblk_t old_blk_size, cur_lblk_off, new_blk_size; > > > > + struct ext4_inode_info *ei = EXT4_I(inode); > > > > + struct ext4_map_blocks map; > > > > + struct ext4_fc_add_range fc_ext; > > > > + struct ext4_fc_del_range lrange; > > > > + struct ext4_extent *ex; > > > > + int ret; > > > > + > > > > + mutex_lock(&ei->i_fc_lock); > > > > + if (ei->i_fc_lblk_len == 0) { > > > > + mutex_unlock(&ei->i_fc_lock); > > > > + return 0; > > > > + } > > > > + old_blk_size = ei->i_fc_lblk_start; > > > > + new_blk_size = ei->i_fc_lblk_start + ei->i_fc_lblk_len - 1; > > > > + ei->i_fc_lblk_len = 0; > > > > + mutex_unlock(&ei->i_fc_lock); > > > > + > > > > + cur_lblk_off = old_blk_size; > > > > + jbd_debug(1, "%s: will try writing %d to %d for inode %ld\n", > > > > + __func__, cur_lblk_off, new_blk_size, inode->i_ino); > > > > + > > > > + while (cur_lblk_off <= new_blk_size) { > > > > + map.m_lblk = cur_lblk_off; > > > > + map.m_len = new_blk_size - cur_lblk_off + 1; > > > > + ret = ext4_map_blocks(NULL, inode, &map, 0); > > > > + if (ret < 0) > > > > + return -ECANCELED; > > > > > > So isn't this actually racy with a risk of stale data exposure? Consider a > > > situation like: > > > > > > Task 1: Task 2: > > > pwrite(file, buf, 8192, 0) > > > punch(file, 0, 4096) > > > fsync(file) > > > writeout range 4096-8192 > > > fastcommit for inode range 0-8192 > > > pwrite(file, buf, 4096, 0) > > > ext4_map_blocks(file) > > > - reports that block at offset 0 is mapped so that is recorded in > > > fastcommit record. But data for that is not written so after a > > > crash we'd expose stale data in that block. > > > > > > Am I missing something? > > So the way this gets handled is before entering this function, the > > inode enters COMMITTING state (in ext4_fc_submit_inode_data_all > > function). Once in COMMITTING state, all the inodes on this inode get > > paused. Also, the commit path waits until all the ongoing updates on > > that inode are completed. Once they are completed, only then its data > > buffers are flushed and this ext4_map_blocks is called. So Task-2 here > > would have either completely finished or would wait until the end of > > this inode's commit. I realize that I should probably add more > > comments to make this more clearer in the code. But is handling it > > this way sufficient or am I missing any more cases? > > I see. In principle this should work. But I don't like that we have yet > another mechanism that needs to properly wrap inode changes to make > fastcommits work. And if we get it wrong somewhere, the breakage will be > almost impossible to notice until someone looses data after a power > failure. So it seems a bit fragile to me. Ack > > Ideally I think we would reuse the current transaction machinery for this > somehow (so that changes added through one transaction handle would behave > atomically wrt to fastcommits) but the details are not clear to me yet. I > need to think more about this... Yeah, I thought about that too. All we need to do is to atomically increment an "number of ongoing updates" counter on an inode, which could be done by existing ext4_journal_start()/stop() functions. However, the problem is that current ext4_journal_start()/stop() don't take inode as an argumen. I considered changing all the ext4_journal_start/stop calls but that would have inflated the size of this patch series which is already pretty big. But we can do that as a follow up cleanup. Does that sound reasonable? > > > > > + > > > > + if (map.m_len == 0) { > > > > + cur_lblk_off++; > > > > + continue; > > > > + } > > > > + > > > > @@ -271,6 +272,7 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb, > > > > > > > > out: > > > > inode_unlock(inode); > > > > + ext4_fc_stop_update(inode); > > > > if (likely(ret > 0)) { > > > > iocb->ki_pos += ret; > > > > ret = generic_write_sync(iocb, ret); > > > > @@ -534,7 +536,9 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from) > > > > goto out; > > > > } > > > > > > > > + ext4_fc_start_update(inode); > > > > ret = ext4_orphan_add(handle, inode); > > > > + ext4_fc_stop_update(inode); > > > > > > Why is here protected only the orphan addition? What about other changes > > > happening to the inode during direct write? > > This is the only change that is protected by handle in this function. > > What I'm trying to do here (and in other places) is that anything that > > happens between ext4_journal_start() and ext4_journal_stop() happens > > atomically. The way to guarantee that is to ensure that the same block > > is also surrounded by ext4_fc_start_update(inode) and > > ext4_fc_stop_update(inode). > > > > I also realized while looking at this comment is that we probably need > > a new TLV for adding orphan inode to the list? > > > > Also I don't see anything tracking inode changes due to writes through mmap? > > > How is that supposed to work? > > Right, I have missed those. I see that mmap function > > ext4_page_mkwrite() calls ext4_jbd2_inode_add_write that tells jbd2 > > what is the range that needs to be written for the inode in question. > > I guess I can just update that function to update inode's FC range as > > well? > > Yes, you need to add tracking of the page range handled in > ext4_page_mkwrite() to FC. Ack > > I've also realized that you probably need to disable fastcommits when data > journaling is enabled for the inode (probably just disable fastcommit > feature with data=jounral mount option, make inode ineligible if it has > 'journal data' flag set). Oh yes, thanks for pointing that out. I'll fix that too Thanks, Harshad > > Honza > -- > Jan Kara <jack@xxxxxxxx> > SUSE Labs, CR