Thanks Andreas and Ted for the review. Yeah, this makes sense. On Mon, Aug 12, 2019 at 9:04 AM Theodore Y. Ts'o <tytso@xxxxxxx> wrote: > > On Thu, Aug 08, 2019 at 08:45:45PM -0700, Harshad Shirwadkar wrote: > > This patch adds new helper APIs that ext4 needs for fast > > commits. These new fast commit APIs are used by subsequent fast commit > > patches to implement fast commits. Following new APIs are added: > > > > /* > > * Returns when either a full commit or a fast commit > > * completes > > */ > > int jbd2_fc_complete_commit(journal_tc *journal, tid_t tid, > > tid_t tid, tid_t subtid) > > I think there is an opportunity to do something more efficient. > > Right now, the ext4_fsync() calls this function, and the file system > can only do a "fast commit" if all of the modifications made to the > file system to date are "fast commit eligible". Otherwise, we have to > fall back to a normal, slow commit. > > We can make this decision on a much more granular level. Suppose that > so far during the life of the current transaction, inodes A, B, and C > have been modified. The modification to inode A is not fast commit > eligible (maybe the inode is deleted, or it is involved in a directory > rename, etc.). The modification to inode B is fast commit eligible, > but an fsync was not requested for it. And the modification to inode > C *is* fast commit eligble, *and* fsync() has been requested for it. > > We only need to write the information for inode C to the fast commit > area. The fact that inode A is not fast commit eligible isn't a > problem. It will get committed when the normal transaction closes, > perhaps when the 5 second commit transaction timer expires. And inode > B, even though its changes might be fast commit eligible, might > require writing a large number of data blocks if it were included in > the fast commit. So excluding inodes A and B from the fast commit, > and only writing the logical changes corresponding to the those made > to inode C, will allow a fast commit to take place. > > In order to do that, though, the ext4's fast commit machinery needs to > know which inode we actually need to do the fast commit for. And so > for that reason, it's actually probably better not to run the changes > through the commit thread. That makes it harder to plumb the file > system specific information through, and it also requires waking up > the commit thread and waiting for it to get scheduled. I see, so you mean each fsync() call will result in exactly one inode to be committed (the inode on which fsync was called), right? I agree this doesn't need to go through JBD2 but we need a mechanism to inform JBD2 about this fast commit since JBD2 maintains sub-transaction ID. JBD2 will in turn need to make sure that a subtid was allocated for such a fast commit and it was incremented once the fast commit was successful as well. > > Instead, ext4_fsync() could just call the fast commit machinery, and > the only thing we need to expose is a way for the fast commit > machinery to attempt to grab a mutex preventing the normal commit > thread from starting a normal commit. If it loses the race, and the > normal commit takes place before we manage to do the fast commit; then > we don't need to do any thing more. Otherwise the fast commit > machinery can do its thing, writing inode changes to the journal, and > once it is done, it can release the mutex and ext4 fsync can return. > > Does that make sense? Thanks for the suggestion, I will implement this in V3. > > - Ted