On Thu, 25 Feb 2021 20:58:20 +0000 Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > In order to walk the page tables without the mmap semaphore, it must > be possible to prevent them from being freed and reused (eg if munmap() > races with viewing /proc/$pid/smaps). > > There is various commentary within the mm on how to prevent this. One way > is to disable interrupts, relying on that to block rcu_sched or IPIs. > I don't think the RT people are terribly happy about reading a proc file > disabling interrupts, and it doesn't work for architectures that free > page tables directly instead of batching them into an rcu_sched (because > the IPI may not be sent to this CPU if the task has never run on it). > > See "Fast GUP" in mm/gup.c > > Ideally, I'd like rcu_read_lock() to delay page table reuse. This is > close to trivial for architectures which use entire pages or multiple > pages for levels of their page tables as we can use the rcu_head embedded > in struct page to queue the page for RCU. > > s390 and powerpc are the only two architectures I know of that have > levels of their page table that are smaller than their PAGE_SIZE. > I'd like to discuss options. There may be a complicated scheme that > allows partial pages to be freed via RCU, but I have something simpler > in mind. For powerpc in particular, it can have a PAGE_SIZE of 64kB > and then the MMU wants to see 4kB entries in the PMD. I suggest that > instead of allocating each 4kB entry individually, we allocate a 64kB > page and fill in 16 consecutive PMDs. This could cost a bit more memory > (although if you've asked for a CONFIG_PAGE_SIZE of 64kB, you presumably > don't care too much about it), but it'll make future page faults cheaper > (as the PMDs will already be present, assuming you have good locality > of reference). > > I'd like to hear better ideas than this. Some background on the situation for s390: The architecture defines an 8 bit pagetable index, so we have 256 entries in a 2 KB pagetable, but PAGE_SIZE is 4 KB. pte_alloc(_one) will use alloc_page() to allocate a full 4 KB page, and then do some housekeeping to maintain a per-mm list of such 4 KB pages, which will contain either one or two 2 KB pagetable fragments. This is also the reason why pgtable_t on s390 is not pointing to the struct page of the (4 KB) page containing a 2 KB pagetable fragment, but rather to the 2 KB pagetable itself. I see at least two issues here, with using rcu_head embedded in the struct page (for a 4 KB page): 1) There might be two 2 KB pagetables present in that 4 KB page, and the rcu_head would affect both. Not sure if this would really be a problem, because we already have a similar situation with the split ptlock embedded in struct page, which also might lock two 2 KB pagetables, i.e. more than necessary. It still is far less "over-locking" than using mm->page_table_lock, and the move_pte() code e.g. takes care to avoid a deadlock if src and dst ptlocks happen to be on the same page. So, a similar "over-locking" might also be possible and acceptable for the rcu_head approach, but I do not really understand if that could have some deadlock or other unwanted side-effects. 2) The "housekeeping" of our 2 KB pagetable fragments uses page->lru to maintain the per-mm list. It also (mis)uses page->_refcount to mark which 2 KB half is used/free, but that should not be an issue I guess. Using page->lru will be an issue though. IIUC, then page->rcu_head will overlay with page->lru, so using page->rcu_head for pagetable pages on s390 would conflict with our page->lru usage for such pagetable pages. I do not really see how that could be fixed, maybe we could find and re-use other struct page members for our 2 KB fragment list. Also, for kvm, there seem to be even more users of page->lru for pagetable pages, in arch/s390/mm/gmap.c. Not sure though if those would actually also affect "regular" pagetable walks, or if they are somehow independent. But if we'd find some new list home for the 2 KB fragments, then that could probably also be used for the gmap stuff.