On 10/21/2016 03:50 PM, Dave Chinner wrote: > On Fri, Oct 21, 2016 at 06:00:07PM +0300, Kirill A. Shutemov wrote: >> On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote: >> To me, most of things you're talking about is highly dependent on access >> pattern generated by userspace: >> >> - we may want to allocate huge pages from byte 1 if we know that file >> will grow; > > delayed allocation takes care of that. We use a growing speculative > delalloc size that kicks in at specific sizes and can be used > directly to determine if a large page shoul dbe allocated. This code > is aware of sparse files, sparse writes, etc. OK, so somebody does a write() of 1 byte. We can delay the underlying block allocation for a long time, but we can *not* delay the memory allocation. We've got to decide before the write() returns. How does delayed allocation help with that decision? I guess we could (always?) allocate small pages up front, and then only bother promoting them once the FS delayed-allocation code kicks in and is *also* giving us underlying large allocations. That punts the logic to the filesystem, which is a bit counterintuitive, but it seems relatively sane. >>> As such, there is no way we should be considering different >>> interfaces and methods for configuring the /same functionality/ just >>> because DAX is enabled or not. It's the /same decision/ that needs >>> to be made, and the filesystem knows an awful lot more about whether >>> huge pages can be used efficiently at the time of access than just >>> about any other actor you can name.... >> >> I'm not convinced that filesystem is in better position to see access >> patterns than mm for page cache. It's not all about on-disk layout. > > Spoken like a true mm developer. IO performance is all about IO > patterns, and the primary contributor to bad IO patterns is bad > filesystem allocation patterns.... :P For writes, I think you have a good point. Managing a horribly fragmented file with larger pages and eating the associated write magnification that comes along with it seems like a recipe for disaster. But, Isn't some level of disconnection between the page cache and the underlying IO patterns a *good* thing? Once we've gone to the trouble of bringing some (potentially very fragmented) data into the page cache, why _not_ manage it in a lower-overhead way if we can? For read-only data it seems like a no-brainer that we'd want things in as large of a management unit as we can get. IOW, why let the underlying block allocation layout hamstring how the memory is managed? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>