On Thu, Dec 12, 2019 at 07:40:02AM -0800, Matthew Wilcox wrote: > > > We currently only have one ->map_pages() callback, and it's > > > filemap_map_pages(). It only needs to sleep in one place -- to allocate > > > a PTE table. I think that can be allocated ahead of time if needed. > > > > No, filemap_map_pages() doesn't sleep. It cannot. Whole body of the > > function is under rcu_read_lock(). It uses pre-allocated page table. > > See do_fault_around(). > > Oh, thank you! That makes the ->map_pages() optimisation already workable > with no changes. I've been thinking about this some more, and we have a bit of a tough time allocating page table entries while holding the RCU read lock. There's no GFP flags to the p??_alloc() functions, so we can't specify GFP_NOWAIT. Option 1: Add 'prealloc_pmd' and 'prealloc_pud' to the vm_fault (to go with prealloc_pte). Allocate them before taking the RCU lock to walk the VMA tree. This will be a bit of reordering as we currently take the mmap_sem, walk the VMA tree, then walk the page tables once we know we have a good VMA. I don't see a problem with doing that, but others may differ. Option 2: Add a memalloc_nowait_save/restore API to go along with nofs and noio. That way, we can take the RCU read lock, call memalloc_nowait_save(), and walk the VMA tree and the page tables in the current order. There's an increased chance of memory allocation of page tables failing, so we'll have to risk that and do a retry with the reference count held on the VMA if we need to sleep to allocate memory. Option 3: Variant of 2 where we add GFP flags to the p??_alloc() functions. Option 4: Variant of 2 where we make taking the RCU read lock magically set the nowait bit, or we have the page allocator check the RCU preempt depth. I don't particularly like this one, particularly since the preempt depth is not knowable in most kernel configurations. Other thoughts on this?