One month ago we reported that Tux3 is now able to beat tmpfs in at least one benchmark load, a claim that generated considerable lively commentary, some of it collegial: https://lkml.org/lkml/2013/5/7/742 Ted raised an interesting question about whether Tux3 is forever destined to suffer from slow fsync as an inherent design tradeoff. To be clear about it: that was never part of our plan. From early on, we intended to rely on full filesystem sync in place of fsync in the interest of correctness, then implement a properly optimized fsync later, probably after mainline merge. Elaborating our simple, rugged full-filesystem atomic commit strategy to accomodate efficient fsync always seemed to be a relatively easy design problem compared to the daunting task of adding strong consistency properties to a filesystem primarily oriented towards writing out single inodes. In the event, our intuition proved correct. First, let me characterize the case where sync performs measurably worse than a purpose-built fsync. If Tux3 has just one dirty file and does a full filesystem sync, it performs well because only a few blocks need to be written. However, if there are dirty inodes in cache that fsync could bypass, sync may exhibit much higher latency than a well targetted fsync. The arguably artificial dbench benchmark load rewards such dirty inode bypass richly by deleting most files while they are still retained in cache, thus avoiding a large amount of disk traffic. The question of whether this models any real world situation accurately is beside the point: if Tux3 wants to win at pure, unadulterated dbench, it better not let fsync push other dirty inodes to disk. We set the following goals for our new, improved fsync design: * Meet or beat Ext4 performance * Keep our strong consistency semantics (similar to Ext3 data=journal) * No performance regression for non-fsync updates * Simple and provably correct As always, the "simple" requirement is the biggest challenge. Various complex and finicky solutions suggest themselves. For example, we could analyze block level dependencies and write out just those blocks involved with a given dirty file. I can imagine such an approach succeeding, but cannot imagine it being simple. After a few days of worrying away at the issues, I hit on a simple, effective idea. This was really just a matter of understanding our existing design more deeply rather than any fundamental breakthrough. (Note: we still do not intend to implement this right now, only design it with a view to putting to rest doubts that were expressed about the innate ability of Tux3 to handle fsync efficiently.) Background Tux3 normally divides episodes of filesystem update activity into chunks called "deltas". Each delta includes all data, namespace and attribute changes caused by some arbitrary set of syscalls. Each delta cleans all dirty cache. This is perfectly efficient for many common filesystem loads, but not all. Fsync is a good example of a case where we would like to leave most dirty data sitting in cache while committing just the dirty blocks of a single file. In other words, dirty data for an fsynced file needs to "jump the queue" of delta commits in the interest of efficiency. Ext4 gets this kind of behavior more or less for free, because it is designed to write out whichever inode that core kernel tells it to, whenever it is told to. In contrast, Tux3 ignores core kernel's plan for which inodes should be written (taking the view that core kernel just does not understand the problem very well) and always writes out all of them. Our immediate project is to relax that behavior to support writing out just one of them. The first big step forward came from noticing that it is easy to move dirty file data forward from the "current" delta (in process of being modified by syscalls) to the pervious delta (scheduled for commit to disk). For each dirty page, Tux3 conceptually maintains two dirty bits, one for the current and one for the previous delta. The "forking" mechanism protects any cache page that is dirty in the previous delta from being modified by subsequent syscalls. It is easy to walk the list of dirty pages of a given inode and change the state of all pages dirty in the current delta to be dirty in the previous delta instead. I call this the "promote" operation, which effectively takes a point in time snapshot of inode data. In our parlance, fsync promotes dirty data from the current to the previous delta. The remainder of the effort to turn this promote tool into an efficient fsync concerns feeding such promoted pages out to disk efficiently. Subdelta concept rejected My first stab at the fsync writeout problem revolved around a "subdelta" concept. A subdelta would be an intermediate commit including just some of the dirty inodes belonging to a given delta. With this, fsync could interrupt the backend as it marshals some big delta for writeout, commit the fsynced inode, then resume committing the remainder of the delta. This idea was vetoed by Hirofumi because it breaks our strong ordering guarantee that we have grown rather fond of: If transaction A completes before transaction B begins, then transaction B will never be durably recorded unless transaction A is too. I agree: stronger consistency semantics are user-friendly; we have become accustomed to them; we will hold that sacred as far as we can. So back to the drawing board. Parallel log streams The winning idea occurred practically simultaneously to both of us. We will elaborate our log design slightly to support parallel log streams. When the backend is in the middle of committing a big delta to disk, it may interrupt its work to start a new log chain that includes only a single inode with its modified attributes. When this completes, the backend resumes committing the remainder of the big delta. For this to work properly there must not be any dependencies between the two log streams. In the case of regular files the log only needs to track block allocations and record new positions of any redirected blocks. To avoid colliding with changes belonging to the in-flight delta we will not modify the inode table immediately but rely on the fsync log to hold the updates until the in-flight delta completes. Then the backend will update the inode table to reflect the synced state in the following delta. Various details need to be addressed to make this work. If an fsynced inode has already been marshalled and is on its way to disk in a previous delta then we need to wait for that delta to complete before committing the fsync. Doing otherwise would require significant design elaboration, and such a situation should be rare compared to the case where an fsynced inode is dirty in the previous delta but not yet marshalled (e.g., updated just before a delta transition and fsynced just after the transition). We must take care that an fsynced inode is not evicted before the inode table is updated to point to the new file data. Log replay becomes slightly more complex. Specifically, updated data attributes from fsync logs need to be entered into the inode table. This is a small amount of additional work to be done only on unexpected restart and has no efficiency impact. If an inode is newly created we must ensure that it is linked from some directory so that it will not leak on unexpected restart. For now we will fall back to sync for that situation, which guarantees that directories are consistent with the inode table. That seems to be about it. No doubt we will discover a few unanticipated details during implementation, however there is nothing more complex than or dissimilar from work we have already completed successfully. When we do get around to it, we should end up performing just fine in pure, unadulterated dbench, for what that is worth. Directory fsync We will not attempt to optimize directory fsync for the time being, not because of any inherent design limitation, but because we need to get back to work on the remaining issues that actually impact base functionality and currently stand in the way of Tux3 being practically usable. I will just mention that directory fsync involves consistency between directory data and the inode table, a topologically more complex problem than consistency between the inode table and regular file data. We are content to let this interesting problem simmer quietly on the back burner for the time being. Shoutout to Samsung I would like to thank Samsung warmly for providing me with a working situation where I can concentrate fully on Tux3 development as a member of the new Samsung Open Source Group. Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html