On Fri, Aug 12, 2011 at 02:05:30PM +1000, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > It only does that if the pattern of writes are such that keeping the > preallocation around for longer periods of time will reduce > potential fragmentation. That can only be false. Here is a an example that I saw *just now*: I have a process that takes a directory with jpg files (in this case, all around 64kb in size) and loslessly recompresses them. This works by reading a file, writing it under another name (single write() call) and using rename to replace the original file *iff* it got smaller. The typical reduction is 5%. no allocsize option is used. Kernel used was 2.6.39. This workload would obviously benefit most by having no preallocaiton anywhere, i.e. have all files tightly packed. Here is a "du" on a big directory where this process is running, every few minutes: 6439892 . 6439888 . 6620168 . 6633156 . 6697588 . 6729092 . 6755808 . 6852192 . 6816632 . 6250824 . Instead of decreasing, the size increased, until just before the last du. Thats where I did echo 3 >drop_caches, which presumably cleared all those inodes that have not been used for an hour and would never have been used again for writing. Since XFS obviously keeps quite a bit of preallocation here (or some other magic, but what?), and this workload definitely does not benefit from any preallocaiton (because xfs has perfect knowledge about the file size at every point in time), what you say is simply not true: The files will not be touched anymore, neither read, nor written, so preallocation is just bad. Also, bickering about extra fragmentation caused by xfs_fsr when running it daily instead of weekly is weird - the amount of external fragmentation caused by preallocation must be overwhelming with large amounts of ram. > Indeed, it's not a NFS specific optimisation, but it is one that > directly benefits NFS server IO patterns. I'd say it's a grotesque deoptimisation, and definitely doesn't work the way you describe it. In fact, it can't work the way you describe it, because XFS would have to be clairvoyant to make it work. How else would it know that keeping preallocation indefinitely will be useful? In any case, XFS detects a typical "open write, file file, close file, never tochu it again" pattern as something that somehow needs preallocation. I can see how that helps NFS, but in all other cases, this is simply a bug. > about). Given that inodes for log files will almost always remain in > memory as they are regularly referenced, it seems like the right > solution to that problem, too... Given that, with enough ram, everything stays in ram, most of which is not log files, this behaviour is simply broken. > FWIW, you make it sound like "benchmark-improved" is a bad thing. If it costs regular performance or eats diskspace like mad, it's clearly a bad thing yes. Benchmark performance is irrelevant, what counts is actual performance. If the two coincide, thats great. This is clearly not the case here, of course. > However, I don't hear you complaining about the delayed logging > optimisations at all. I wouldn't be surprised if the new xfs_fsr crashes are caused by these changes, actually. But yes, otherwise they are great - I do keep external journals for most of my filesystems, and the write load for these has decreased by a factor of 10-100 in some metadata-heavy cases (such as lots of renames). Of course, XFS is still way behind other filesystems in managing journal devices. > I'll let you in on a dirty little secret: I tested delayed logging on > nothing but benchmarks - it is -entirely- a "benchmark-improved" class > optimisation. As a good engineer one would expect you to actually think about whether this optimiation is useful outside of some benchmark setup, too. I am sure you did that, how else would you have come up with the idea in the first place? > But despite how delayed logging was developed and optimised, it The difference to the new preallocation is that it's not obviously a bad algorithm. However, the preallocation strategy of wasting some diskspace for every file that has been opened in the last 24 hours or so (depending on ram) is *obviously* wrong, regardless of what your microbenchmarks say. What it does is basically introduce big clusters allocation, just like with god old FAT, except that people with more RAM get punished more. > different workloads. That's because the benchmarks I use accurately > model the workloads that cause the problem that needs to be solved. That means you will optimise a single problem at the expense of any other workload. This indeed seems to be the case here. Good engineering would make sure that typical use cases that were not the "problem" before wouldn't get unduly affected. Apart from potentially helping with NFS in your benchmarks, I cannot see any positive aspect of this change. However, I keep hitting the bad aspects of it. It seems that with this change, XFS will degrade much faster due to the insane amounts of useless preallocation tied to files that have been closed and will never be written again, which is by far *most* files. In the example above, roughly 32kb (+-50%) overallocation are associated with each file. FAT, here we come :( Don't get me wrong, it is great that XFS is now optimised for slow log writing over NFS, and this surely is important for some people, but this comes at an enourmous cost to every other workload. A benchmark that measures additional fragmentation introduced by all those 32kb blocks over some months would be nice. > Similarly, the "NFS optimisation" in a significant and measurable > reduction in fragmentation on NFS-exported XFS filesystems across a It's the dirtiest hack I have seen in a filesystem. Making an optimisaiton that only helps with the extremely bad access patterns of NFS (and only sometimes) and forcing this on even for non-NFS filesystems where it only causes negative effects. It's a typical case of "a is broken, so apply some hack to b", while good engineering dictates "a is broken, let's fix a". Again: Your rationale is that NFS doesn't give you enough information about whether a file is in use, because it doesn't keep it open. This leads you to consider all files whose inode is cached in memory as being "in use" for unlimited amounts of time. Sure, those idiot applications such as cp or mv cannot be trusted. Surely, when mv'ing a file, this means the file will be appended later. Because if not, XFS wouldn't keep the preallocation. > Yes, there have been regressions caused by both changes (though The whole thing is a regression - slow appender processes that close a file after each write basically don't exist - close is an extremely good hint that a file has been finalised, and because NFS doesn't give the notion of close (nfsv4 has it, to some extent), suddenly it's ignored for all applications. This is simply a completely, utterly, totally broken algorithm. > regressions does not take anything away from the significant > real-world improvements that are the result of the changes. I gave plenty of real-world examples where these changes are nothing but bad. I have yet to see a *single* real-world example where this isn't the case. All you achieved is that now every workload works as bad as NFS, lots and lots of disk space is wasted, and an enourmous amount of external fragmentation is introduced. And thats just with an 8GB box. I can only imagine how many months files will be considered "in use" just because the box has enough ram to cache their inodes. > http://code.google.com/p/ioapps/wiki/ioreplay Since "cp" and "mv" already cause problems in current versions of XFS, I guess we are far from needing those. It seems XFS has been so fundamentally deoptimised w.r.t. preallocation now that there are much bigger fish to catch than freenet. Basically anything thct creates files, even when it's just a single open/write/close, is now affected. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@xxxxxxxxxx -=====/_/_//_/\_,_/ /_/\_\ _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs