Tux3 Report: How fast can we fail? Tux3 now has a preliminary out of space handling algorithm. This might sound like a small thing, but in fact handling out of space reliably and efficiently is really hard, especially for Tux3. We developed an original solution with unusually low overhead in the common case, and simple enough to prove correct. Reliability seems good so far. But not to keep anyone in suspense: Tux3 does not fail very fast, but it fails very reliably. We like to think that Tux3 is better at succeeding than failing. We identified the following quality metrics for this algorithm: 1) Never fails to detect out of space in the front end. 2) Always fills a volume to 100% before reporting out of space. 3) Allows rm, rmdir and truncate even when a volume is full. 4) Writing to a nearly full volume is not excessively slow. 5) Overhead is insignificant when a volume is far from full. Like every filesystem that does delayed allocation, Tux3 must guess how much media space will be needed to commit any update it accepts into cache. It must not guess low or the commit may fail and lose data. This is especially tricky for Tux3 because it does not track individual updates, but instead, partitions updates atomically into delta groups and commits each delta as an atomic unit. A single delta can be as large as writable cache, including thousands of individual updates. This delta scheme ensures perfect namespace, metadata and data consistency without complex tracking of relationships between thousands of cache objects, and also does delayed allocation about as well as it can be done. Given these benefits, it is not too hard to accept some extra pain in out of space accounting. Speaking of accounting, we borrow some of that terminology to talk about the problem. Each delta has a "budget" and computes a "balance" that declines each time a transaction "cost" is "charged" against it. The budget is all of free space, plus some space that belongs to the current disk image that we know will be released soon, and less a reserve for taking care of certain backend duties. When the balance goes negative, the transaction backs out its cost, triggers a delta transition, and tries again. This has the effect of shrinking the delta size as a volume approaches full. When the delta budget finally shrinks to less than the transaction cost, the update fails with ENOSPC. This is where the "how fast can we fail" question comes up. If our guess at cost is way higher than actual blocks consumed, deltas take a long time to shrink. Overestimating transaction cost by a factor of ten can trigger over a hundred deltas before failing. Fortunately, deltas are pretty fast, so we only keep the user waiting for a second or so before delivering the bad news. We also slow down measurably, but not horribly, when getting close to full. Ext4 by contrast flies along at full speed right until it fills the volume, and stops on a dime at exactly at 100% full. I don't think that Tux3 will ever be as good at failing as that, but we will try to get close. Before I get into how Tux3's out of space behavior stacks up against other filesystems, there are some interesting details to touch on about how we go about things. Tux3's front/back arrangement is lockless, which is great for performance but turns into a problem when front and back need to cooperate about something like free space accounting. If we were willing to add a spinlock between front and back this would be easy, but don't want to do that. Not only are we jealously protective of our lockless design, but if our fast path suddenly became slower because of adding essential functionality we might need to revise some posted benchmark results. Better that we should do it right and get our accounting almost for free. The world of lockless algorithms is an arcane one indeed, just ask Paul McKenney about that. The solution we came up with needs just two atomic adds per transaction, and we will eventually turn one of those into a per-cpu counter. As mentioned above, a frontend transaction backs out its cost when the delta balance goes negative, so from the backend's point of view, the balance is going up and down unpredictably all the time. Delta transition can happen at any time, and somehow, the backend must assign the new front delta its budget exactly at transition. Meanwhile, the front delta balance is still going up and down unpredictably. See the problem? The issue is, delta transition is truly asynchronous. We can't change that short of adding locks with the contention and stalls that go along with them. Fortunately, one consequence of delta transition is that the total cost charged to the delta instantly becomes stable when the front delta becomes the back delta. Volume free space is also stable because only the backend accesses it. The backend can easily measure the actual space consumed by the back delta: it is the difference between free space before and after flushing to media. Updating the front delta budget is easy because only the backend changes it, but updating the front delta balance is much harder because the front delta is busy changing it. If we get this wrong, the resulting slight discrepancies between budget, balance and charged costs would mean that somebody, somewhere will hit out of space in the middle of a commit and end up sticking pins into a voodoo doll that looks like us. A solution was found that only took a few lines of code and some pencil pushing. The backend knows what the front delta balance must have been exactly at transition, because it knows the amount charged to the back delta, and it knows the original budget. It can therefore deduce how much the front balance should have increased exactly at transition (it must always increase) so it adds that difference atomically to the front delta budget. This has exactly the same effect as setting the balance atomically at transition time, if that were possible, which it is not. This technique is airtight, and the whole algorithm ends up costing less than a hundred nanoseconds per transaction.[1] This is a good thing because each page of a Tux3 write is a separate transaction, so any significant overhead would stick out like a sore thumb. Accounting cost estimates properly and stopping when actually out of space is just the core of the algorithm. We must feed that core with good, provable worst case cost estimates. To get an initial idea of whether the algorithm works, we just plugged in some magic numbers, and lo and behold, suddenly we where not running out of space in the backend any more. But to do the job properly we need to consider things like the file index btree depth, because just plugging in a number large enough to handle the deepest possible btree would slow down our failure path way too much. The best way to account for btree depth is to make it disappear entirely by removing the btree updates from the delta commit path. We already do that for bitmaps, which is a good thing because our bitmaps are just blocks that live in a normal file. Multiplying our worst case by the maximum number of bitmaps that could possibly be affected, and then multiplying that by the worst case change to the bitmap metadata, including its data btree, its inode, and the inode table btree, would be a real horror. Instead, we log all changes that affect the bitmap and only update the bitmaps periodically at unify cycles. A Tux3 filesystem is consistent whether or not we unify, so if space becomes critically tight the backend can just disable the unify. The only bad effect is that the log chain can grow and make replay take longer, but that growth is limited by the fact that there is not much space left for more log blocks. If we did not have this nice way of making bitmap overhead disappear, we would not be anywhere close to a respectable implementation today. Actually, we weren't even thinking about out of space accounting when we developed this design element, we were actually trying to get rid of the overhead of updating bitmaps per delta. Which worked well and is a significant part of the reason why we can outrun Ext4 while having a very similar structure. The benefit for space accounting dropped out just by dumb luck. The same technique we use for hiding bitmap update cost works just as well for btree metadata. Eventually, we will move btree leaf redirecting from the delta flush to the unify flush. That will speed it up by coalescing some index block writes and also make it vanish from the transaction cost estimate, saving frontend CPU and speeding up the failure path. What's not to like? It is on the list of things to do. Today, I refactored the budgeting algorithm to skip the cost estimate if a page is already dirty, which tightened up the estimate by a factor of four or so and made things run smoother. There will be more incremental improvements as time goes by. For example, we currently overestimate the cost of a rewrite because we would need to go poking around in btrees to do that more accurately. Fixing that will be quite a bit of work, but somebody will probably do it, sometime. Now the fun part: performance and bugs. Being anxious to know where Tux3 stands with respect to the usual suspects, I ran some tests and found that Ext4 is amazingly good at this, while XFS and Btrfs have some serious issues. Details below. Tux3 slows down when approaching a full state, hopefully not too much. To quantify that, here is what happens with a 200 MB dd to a loopback mounted file on tmpfs: Volume Size Run Time No check at all: 1500 MB 0.306s Far from full: 1500 MB 0.318s Getting full: 30 MB 0.386s Just over full: 20 MB 0.624s The slowdown used to be a lot more before I improved the cost estimate for write earlier today. Here is how we compare to other filesystems: Far from full Just over full tux3: 0.303s 0.468s ext4: 0.399s 0.400s xfs: 0.293s 0.326s btrfs: 0.499s 0.531s (20 mb dd to ramdisk) XFS eeks out a narrow win on straight up dd to the ramdisk, good job. The gap widens when hitting the failure path, but again, not as much as it did earlier today. I do most of these no space tests on a ramdisk (actually, a loopback mount on tmpfs) because it is easy to fill up. To show that the ramdisk results are not wildly different from a real disk, here we see that the pattern is largely unchanged: 20 MB dd to a real disk tux3: 1.568s ext4: 1.523s xfs: 1.466s btrfs: 2.347s XFS holds its dd lead on a real hard disk. We definitely need to learn its trick. Next we look at something with a bit more meat: unzipping the Git source to multiple directories. Timings on ramdisk are the interesting ones, because the volume approaches full on the longer test. 10x to ram 40x to ram 10x to hdd 100x to hdd tux3: 2.251s 8.344s 2.671s 21.686s ext4: 2.343s 7.923s 3.165s 32.080s xfs: 2.682s 10.562s 11.870s 123.435s btrfs: 3.744s 15.825s 3.749s 72.405s Tux3 is the fastest when not close to full, but Ext4 takes a slight lead when close to full. Yesterday, that lead was much wider, and one day we would be pleased to tie Ext4, which is really exemplary at this. The hard disk times are there because they happened to be easy to get, and it is interesting to see how much XFS and BTRFS are suffering on traditional rust, XFS being the worst by far at 5.7 times slower than Tux3. The next one is a crash test: repeatedly untar a tarball until it smashes into the wall, and see how long it takes to quit with an error. Tar is nice for this because its failure handling is so awful: instead of exiting when on the first ENOSPC, it keeps banging at the full disk until it has failed on each and every file in its archive. First I dd the volume until just before full, then throw a tarball at it. Time to fail when tar hits disk full: tux3: 0.563s ext4: 0.084s xfs: 0.116s btrfs: 0.460s We respectfully concede that Ext4 is the king of fail and Tux3 is worst. However, we only need to be good enough on this one, with less than a second being a very crude definition of good enough. The next one is something I ran into when I was testing out of space detection with rewrites. This uses the "blurt" program at the end of this post to do 40K writes from 1000 tasks in parallel, 1K at a time, using the bash one liner: for ((j=1;j<10;j++)); do \ for ((i=1;i<10;i++)); do \ echo step $j:$i 1>&2 && mkdir -p fs/$i && \ ~/blurt fs/$i/f 40 1000 || break 2; \ done; \ done Tux3: 4.136s (28% full when done) Ext4: 5.780s (31% full when done) XFS: 79.063s (!) Btrfs: 15.489s (fails with out of space when 30% full) Blurt is a minor revision of my fsync load generator without the fsync, and with an error exit on disk full. The intent of the outer loop is to do rewrites with a thousand tasks in parallel, and see if out of space accounting is accurate. XFS and Btrfs both embarrassed themselves horribly. XFS falls off a performance cliff that makes it 19 times slower than Tux3, and Btrfs hits ENOSPC when only 30% full according to df, or 47% full if you prefer to believe its own df command: Data, single: total=728.00MiB, used=342.25MiB System, DUP: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00B Metadata, DUP: total=65.00MiB, used=5.97MiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=16.00MiB, used=0.00B It seems that Btrfs still has not put its epic ENOSPC nightmare behind it. I fervently hope that such a fate does not await Tux3, which hope would appear to be well on its way to being born out. XFS should not do such bizarre things after 23 years of development, while being billed as a mature, enterprise grade filesystem. It simply is not there yet. Ext4 is exemplary in terms of reliability, and Tux3 has been been really good through this round of torture tests, though I will not claim that it is properly hardened just yet. I know it isn't. We don't have any open bugs, but that is probably because we only have two users. But Tux3 is remarkably solid for the number of man years that have gone into it. Maybe Tux3 really will be ready for the enterprise before XFS is. In all of these tests, Tux3, Ext4 and XFS managed to fill up their volumes to exactly 100%. Tux3 actually has a 100 block emergency reserve that it never fills, and wastes a few more blocks if the last transaction does not exactly use up its budget, but apparently that still falls within the df utility's definition of 100%. Btrfs never gets this right: full for it tends to range from 96% to 98%, and sometimes is much lower, like 28%. It has its own definition of disk full in its own utility, but that does not seem to be very accurate either. This part of Btrfs needs major work. Even at this early stage, Tux3 is much better than that. One thing we can all rejoice over: nobody ever hit out of space while trying to commit. At least, nobody ever admitted it. And nobody oopsed, or asserted, though XFS did exhibit some denial of service issues where the filesystem was unusable for tens of seconds. Once again, in the full disclosure department: there are some known holes remaining in Tux3's out of space handling. The unify suspend algorithm is not yet implemented, without which we cannot guarantee that out of space will never happen in commit. With the simple expedient of a 100 block emergency reserve, it has never yet happened, but no doubt some as yet untested load can make it happen. ENOSPC handling for mmap is not yet implemented. Cost estimates for namespace operations are too crude and ignore btree depth. Cost estimates could be tighter than they are, to give better performance and report disk full more promptly. The emergency reserve should be set each delta according to delta budget. Big truncates need to be split over multiple commits so they always free more blocks than they consume before commit. That is about it. On the whole, I am really happy with the way this has worked out. Well, that is that for today. Tux3 now has decent out of space handling that appears to work well and has a good strong theoretical basis. It needs more work, but is no longer a reason to block Tux3 from merging, if it ever really was. Regards, Daniel [1] Overhead of an uncontended bus locked add is about 6 nanoseconds on my i5, and about ten times higher when contended. /* * Blurt v0.0 * * A trivial multitasking filesystem load generator * * Daniel Phillips, June 2015 * * to build: c99 -Wall blurt.c -oblurt * to run: blurt <basename> <steps> <tasks> */ #include <unistd.h> #include <stdlib.h> #include <stdio.h> #include <fcntl.h> #include <sys/wait.h> #include <errno.h> #include <sys/types.h> #include <sys/stat.h> enum { chunk = 1024, sync = 0 }; char text[chunk] = { "hello world!\n" }; int main(int argc, const char *argv[]) { const char *basename = argc < 1 ? "foo" : argv[1]; char name[100]; int steps = argc < 3 ? 1 : atoi(argv[2]); int tasks = argc < 4 ? 1 : atoi(argv[3]); int fd, status, errors = 0; for (int t = 0; t < tasks; t++) { snprintf(name, sizeof name, "%s%i", basename, t); if (!fork()) goto child; } for (int t = 0; t < tasks; t++) { wait(&status); if (WIFEXITED(status) && WEXITSTATUS(status)) errors++; } return !!errors; child: fd = creat(name, S_IRWXU); if (fd == -1) goto fail1; for (int i = 0; i < steps; i++) { int ret = write(fd, text, sizeof text); if (ret == -1) goto fail2; if (sync) fsync(fd); } return 0; fail1: perror("create failed"); return 1; fail2: perror("write failed"); return 1; } -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html