[adding linux-nvme and linux-block for opinions on the critical-page-first idea in the second and third paragraphs below] On Wed, Feb 20, 2019 at 07:07:29AM -0700, William Kucharski wrote: > > On Feb 20, 2019, at 6:44 AM, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > That interface would need to have some hint from the VFS as to what > > range of file offsets it's looking for, and which page is the critical > > one. Maybe that's as simple as passing in pgoff and order, where pgoff is > > not necessarily aligned to 1<<order. Or maybe we want to explicitly > > pass in start, end, critical. > > The order is especially important, as I think it's vital that the FS can > tell the difference between a caller wanting 2M in PAGESIZE pages > (something that could be satisfied by taking multiple trips through the > existing readahead) or needing to transfer ALL the content for a 2M page > as the fault can't be satisfied until the operation is complete. There's an open question here (at least in my mind) whether it's worth transferring the critical page first and creating a temporary PTE mapping for just that one page, then filling in the other 511 pages around it and replacing it with a PMD-sized mapping. We've had similar discussions around this with zeroing freshly-allocated PMD pages, but I'm not aware of anyone showing any numbers. The only reason this might be a win is that we wouldn't have to flush remote CPUs when replacing the PTE mapping with a PMD mapping because they would both map to the same page. It might be a complete loss because IO systems are generally set up for working well with large contiguous IOs rather than returning a page here, 12 pages there and then 499 pages there. To a certain extent we fixed that in NVMe; where SCSI required transferring bytes in order across the wire, an NVMe device is provided with a list of pages and can transfer bytes in whatever way makes most sense for it. What NVMe doesn't have is a way for the host to tell the controller "Here's a 2MB sized I/O; bytes 40960 to 45056 are most important to me; please give me a completion event once those bytes are valid and then another completion event once the entire I/O is finished". I have no idea if hardware designers would be interested in adding that kind of complexity, but this is why we also have I/O people at the same meeting, so we can get these kinds of whole-stack discussions going. > It also > won't be long before reading 1G at a time to map PUD-sized pages becomes > more important, plus the need to support various sizes in-between for > architectures like ARM that support them (see the non-standard size THP > discussion for more on that.) The critical-page-first notion becomes even more interesting at these larger sizes. If a memory system is capable of, say, 40GB/s, it can only handle 40 1GB page faults per second, and each individual page fault takes 25ms. That's rotating rust latencies ;-) > I'm also hoping the conference would have enough "mixer" time that MM folks > can have a nice discussion with the FS folks to get their input - or at the > very least these mail threads will get that ball rolling. Yes, there are both joint sessions (sometimes plenary with all three streams, sometimes two streams) and plenty of time allocated to inter-session discussions. There's usually substantial on-site meal and coffee breaks during which many important unscheduled discussions take place.