On Mon, Oct 24, 2016 at 01:34:53PM -0700, Dave Hansen wrote: > On 10/21/2016 03:50 PM, Dave Chinner wrote: > > On Fri, Oct 21, 2016 at 06:00:07PM +0300, Kirill A. Shutemov wrote: > >> On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote: > >> To me, most of things you're talking about is highly dependent on access > >> pattern generated by userspace: > >> > >> - we may want to allocate huge pages from byte 1 if we know that file > >> will grow; > > > > delayed allocation takes care of that. We use a growing speculative > > delalloc size that kicks in at specific sizes and can be used > > directly to determine if a large page shoul dbe allocated. This code > > is aware of sparse files, sparse writes, etc. > > OK, so somebody does a write() of 1 byte. We can delay the underlying > block allocation for a long time, but we can *not* delay the memory > allocation. We've got to decide before the write() returns. > How does delayed allocation help with that decision? You (and Kirill) have likely misunderstood what I'm saying, based on the fact you are thinking I'm talking about delayed allocation of page cache pages. I'm not. The current code does this for a sequential write: write( off, len) for each PAGE_SIZE chunk grab page cache page alloc + insert if not found map block to page get_block(off, PAGE_SIZE); filesystem does allocation <<<<<<<<<<<< update bufferhead attached to page write data into page Essentially, delayed block allocation occurs inside the get_block() call, completely hidden from the page cache and IO layers. In XFS, we special stuff based on the offset being written to, the size of the existing extent we are adding, etc. to specualtively /reserve/ more blocks that the write actually needs and keep them in a delalloc extent that extends beyond EOF. The next page is grabbed, get_block is called again, and we find we've already got a delalloc reservation for that file offset, so we return immediately. And we repeat that until the delalloc extent runs out. When it runs out, we allocate a bigger delalloc extent beyond EOF so that as the file grows we do fewer and fewer larger delayed allocation reservations. These grow out to /gigabytes/ if there is that much data to write. i.e. the filesystem clearly knows when using large pages would be appropriate, but because it's inside the page cache allocation, it can't influence it at all. Here's what the new fs/iomap.c code does for the same sequential write: write(off, len) iomap_begin(off, len) filesystem does delayed allocation of at least len bytes <<<<<< returns an iomap with single mapping iomap_apply() for each PAGE_SIZE chunk grab page cache page alloc + insert if not found map iomap to page write data into page iomap_end() Hence if the write was for 1 byte into an empty file, we'd get a single block extent back, which would match to a single PAGE_SIZE page cache allocation required. If the app is doing sequential 1 byte IO, and we're at offset 2MB and the filesystem returns a 2MB delalloc extent (i.e. extends 2MB byte beyond EOF), we know we're getting sequential write IO and we could use a 2MB page in the page cache for this. Similarly, if the app is doing large IO - say 16MB at a time, we'll get at least a 16MB delalloc extent returned from the filesystem, and we know we could map that quickly and easily to 8 x 2MB huge pages in the page cache. But if we get random 4k writes into a sparse file, the filesystem will be allocating single blocks, so the iomaps being returned would be for a single block, and we know that PAGE_SIZE pages would be best to allocate. Or we could have a 2 MB extent size hint set, so every iomap returned from the filesysetm is going to be 2MB aligned and sized, in which case we'd could always map the returned iomap to a huge page rather than worry about Io sizes and incoming IO patterns. Now do you see the difference? The filesystem now has the ability to optimise allocation based on user application IO patterns rather than trying to guess from single page size block mapping requests. And because this happens before we look at the page cache, we can use that information to influence what we do with the page cache. > >> I'm not convinced that filesystem is in better position to see access > >> patterns than mm for page cache. It's not all about on-disk layout. > > > > Spoken like a true mm developer. IO performance is all about IO > > patterns, and the primary contributor to bad IO patterns is bad > > filesystem allocation patterns.... :P > > For writes, I think you have a good point. Managing a horribly > fragmented file with larger pages and eating the associated write > magnification that comes along with it seems like a recipe for disaster. > > But, Isn't some level of disconnection between the page cache and the > underlying IO patterns a *good* thing? Up to a point. Buffered IO only hides underlying IO patterns whilst readahead and writebehind are effective. They rely on the filesystem laying out files sequentially for good read rates, and writes to hit the disk sequentially for good write rates. The moment either of these two things break down, the latency induced by the underlying IO patterns will immediately dominate buffered IO performance characteristics. The filesystem allocation and mapping routines will do a far better job is we can feed them extents rather than 4k pages at a time. Kirill is struggling to work out how to get huge pages into the page cache readahead model, and this is something that we'll solve by adding a readahead implemenation to the iomap code. i.e. iomap the extent, decide on how much to read ahead and the page size to use, populate the page cache, build the bios, off we go.... The problem with the existing code is that the page cache breaks IO up into discrete PAGE_SIZE mapping requests. For all the fs knows it could be 1 byte IOs or it could be 1GB IOs being done. We can do (and already have done) an awful lot of optimisation if we know exactly what IO is being thrown at the fs. Effectively we are moving the IO path mapping methods from a single-page-and-block iteration model to single-extent-multiple-page iteration model. It's a different mechanism, but it's a lot more efficient for moving data around than what we have now... > Once we've gone to the trouble > of bringing some (potentially very fragmented) data into the page cache, > why _not_ manage it in a lower-overhead way if we can? For read-only > data it seems like a no-brainer that we'd want things in as large of a > management unit as we can get. > > IOW, why let the underlying block allocation layout hamstring how the > memory is managed? We're not changing how memory is managed at all. All we are doing is is changing how we populate the page cache.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>