On Thu, Jul 31, 2014 at 09:03:32AM -0400, Theodore Ts'o wrote: > On Wed, Jul 30, 2014 at 10:49:28AM -0400, Benjamin LaHaise wrote: > > This seems like a pretty serious regression relative to ext3. Why can't > > ext4's mballoc pick better block groups to attempt allocating from based > > on the free block counts in the block group summaries? > > Allocation algorithms are *always* tradeoffs. So I don't think > regression is necessarily the best way to think about things. > Unfortuntaely, your use case really doesn't work well with how we have > set things up with ext4 now. Sure, if you your specific use case is > one where you are mostly allocating 8MB files, then we can add a > special case where if you are allocating 32768 blocks, we should > search for block groups that have 32768 blocks free. And if that's > what you are asking for, we can certainly do that. The workload targets allocation 8MB files, mostly because that is a size that is large enough to perform fairly decently, but small enough to not incur too much latency for each write. Depending on other dynamics in the system, it's possible to end up with files as small as 8K, or as large as 30MB. The target file size can certainly be tuned up or down if that makes life easier for the filesystem. > The problem is that free block counts don't work well in general. If > I see that the free block count is 2048 blocks, that doesn't tell me > the free blocks are in a contiguous single chunk of 2048 blocks, or > 2048 single block items. (We do actually pay attention to free > blocks, by the way, but it's in a nuanced way.) > > If the only goal you have is fast block allocation after fail over, > you can always use the VFAT block allocation --- i.e., use the first > free block in the file system. Unfortunately, it will result in a > very badly fragmented file system, as Microsoft and its users > discovered. Fragmentation is not a huge concern, but is more acceptable if the time to perform an allocation increases. Time to perform a write is hugely important, as the system will have more and more data coming in as time progresses. At present under load the system has to be able to sustain 550MB/s of writes to disk for an extended period of time. With 8MB writes that means we can't tolerate very many multi second writes. I am of the opinion that expecting the filesystem to be able to sustain 550MB/s is reasonable given that the underlying disk array can perform sequential reads/writes at more than 1GB/s and has a reasonably large amount of write back cache (512MB) on the RAID controller. The use-case is essentially making use of the filesystem as an elastic buffer for queues of messages. Under normal conditions all of the data is received and then sent out within a fairly short period of time, but sometimes there are receivers that are slow or offline which means that the in memory buffers get filled and need to be spilled out to disk. Many users of the system cycle this behaviour over the course of a single day. They receive a lot of data during business hours, then process and drain it over the course of the evening. Since everything is cyclic, and reads are slow anyways, long term fragmentation of the filesystem isn't a significant concern. > I'm sure that are things we could do that would make things better for > your workload (if you want to tell us in great detail exactly what the > file/block allocation patterns are for your workload), and perhaps > even better in general, but the challenge is making sure we don't > regress for other workloads --- and this includes long-term > fragmentation resistance. This is a hard problem. Kvetching about > how it's so horrible just for you isn't really helpful for solving it. I'm kvetching mostly because the mballoc code is hugely complicated and easy to break (and oh have I broken it). If you can point me in the right direction for possible improvements that you think might improve mballoc, I'll certainly give them a try. Hopefully the above descriptions of the workload make it a bit easier to understand what's going on in the big picture. I also don't think this problem is limited to my particular use-case. Any ext4 filesystem that is 7TB or more and gets up into the 80-90% utilization will probably start exhibiting this problem. I do wonder if it is at all possible to fix this issue without replacing the bitmaps used to track free space with something better suited to the task on such large filesystems. Pulling in hundreds of megabytes of bitmap blocks is always going to hurt. Fixing that would mean either compressing the bitmaps into something that can be read more quickly, or wholesale replacement of the bitmaps with something else. > (BTW, one of the problems is that ext4_mb_normalize_request caps large > allocations so that we use the same goal length for multiple passes as > we search for good block groups. We might want to use the original > goal length --- so long as it is less than 32768 blocks --- for the > first scan, or at least for goal lengths which are powers of two. So > if your application is regularly allocating files which are exactly > 8MB, there are probably some optimizations that we could apply. But > if they aren't exactly 8MB, life gets a bit trickier.) And sadly, they're not always 8MB. If there's anything I can do on the application side to make the filesystem's life easier, I would happily do so, but we're already doing fallocate() and making the writes in a single write() operation. There's not much more I can think of that's low hanging fruit. Cheers, -ben > Regards, > > - Ted -- "Thought is the essence of where you are now." -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html