On Mon, Jan 06, 2020 at 02:09:10PM -0800, Matthew Wilcox wrote: > On Thu, Dec 12, 2019 at 07:40:02AM -0800, Matthew Wilcox wrote: > > > > We currently only have one ->map_pages() callback, and it's > > > > filemap_map_pages(). It only needs to sleep in one place -- to allocate > > > > a PTE table. I think that can be allocated ahead of time if needed. > > > > > > No, filemap_map_pages() doesn't sleep. It cannot. Whole body of the > > > function is under rcu_read_lock(). It uses pre-allocated page table. > > > See do_fault_around(). > > > > Oh, thank you! That makes the ->map_pages() optimisation already workable > > with no changes. > > I've been thinking about this some more, and we have a bit of a tough time > allocating page table entries while holding the RCU read lock. There's > no GFP flags to the p??_alloc() functions, so we can't specify GFP_NOWAIT. > > Option 1: Add 'prealloc_pmd' and 'prealloc_pud' to the vm_fault (to go > with prealloc_pte). Allocate them before taking the RCU lock to walk > the VMA tree. This will be a bit of reordering as we currently take > the mmap_sem, walk the VMA tree, then walk the page tables once we know > we have a good VMA. I don't see a problem with doing that, but others > may differ. I expect preallocating all these page tables just-in-case would have measuable performance impact. Current code only preallocates PTE page table if sees pmd_none(). We may first check if this branch of the tree is present. But I'm not sure how efficient it can be. And we still need to protect from freeing these page tables from under us. > Option 2: Add a memalloc_nowait_save/restore API to go along > with nofs and noio. That way, we can take the RCU read lock, call > memalloc_nowait_save(), and walk the VMA tree and the page tables in > the current order. There's an increased chance of memory allocation of > page tables failing, so we'll have to risk that and do a retry with the > reference count held on the VMA if we need to sleep to allocate memory. > > Option 3: Variant of 2 where we add GFP flags to the p??_alloc() > functions. I think this is the most reasonable way. If we are low of memory, latency is not on the top of priorities. > Option 4: Variant of 2 where we make taking the RCU read lock magically > set the nowait bit, or we have the page allocator check the RCU preempt > depth. I don't particularly like this one, particularly since the > preempt depth is not knowable in most kernel configurations. > > Other thoughts on this? -- Kirill A. Shutemov