On Thu, 2018-03-08 at 14:34 -0600, Gratian Crisan wrote: > Hi all, > > We are seeing kernel page faults happening on module loads with certain > drivers like the i915 video driver[1]. This was initially discovered on > a 4.9 PREEMPT_RT kernel. It takes 5 days on average to reproduce using a > simple reboot loop test. Looking at the code paths involved I believe > the issue is still present in the latest vanilla kernel. > > Some relevant points are: > > * x86_64 CPU: Intel Atom E3940 > > * CONFIG_HUGETLBFS is not set (which also gates CONFIG_HUGETLB_PAGE) > > Based on function traces I was able to gather the sequence of events is: > > 1. Driver starts a ioremap operation for a region that is PMD_SIZE in > size (or PUD_SIZE). > > 2. The ioremap() operation is preempted while it's in the middle of > setting up the page mappings: > ioremap_page_range->...->ioremap_pmd_range->pmd_set_huge <<preempted>> > > 3. Unrelated tasks run. Traces also include some cross core scheduling > IPI calls. > > 4. Driver resumes execution finishes the ioremap operation and tries to > access the newly mapped IO region. This triggers a vmalloc fault. > > 5. The vmalloc_fault() function hits a kernel page fault when trying to > dereference a non-existent *pte_ref. > > The reason this happens is the code paths called from ioremap_page_range() > make different assumptions about when a large page (pud/pmd) mapping can be > used versus the code paths in vmalloc_fault(). > > Using the PMD sized ioremap case as an example (the PUD case is similar): > ioremap_pmd_range() calls ioremap_pmd_enabled() which is gated by > CONFIG_HAVE_ARCH_HUGE_VMAP. On x86_64 this will return true unless the > "nohugeiomap" kernel boot parameter is passed in. > > On the other hand, in the rare case when a page fault happens in the > ioremap'ed region, vmalloc_fault() calls the pmd_huge() function to check > if a PMD page is marked huge or if it should go on and get a reference to > the PTE. However pmd_huge() is conditionally compiled based on the user > configured CONFIG_HUGETLB_PAGE selected by CONFIG_HUGETLBFS. If the > CONFIG_HUGETLBFS option is not enabled pmd_huge() is always defined to be > 0. > > The end result is an OOPS in vmalloc_fault() when the non-existent pte_ref > is dereferenced because the test for pmd_huge() failed. > > Commit f4eafd8bcd52 ("x86/mm: Fix vmalloc_fault() to handle large pages > properly") attempted to fix the mismatch between ioremap() and > vmalloc_fault() with regards to huge page handling but it missed this use > case. > > I am working on a simpler reproducing case however so far I've been > unsuccessful in re-creating the conditions that trigger the vmalloc fault > in the first place. Adding explicit scheduling points in > ioremap_pmd_range/pmd_set_huge doesn't seem to be sufficient. Ideas > appreciated. > > Any thoughts on what a correct fix would look like? Should the ioremap > code paths respect the HUGETLBFS config or would it be better for the > vmalloc fault code paths to match the tests used in ioremap and not rely > on the HUGETLBFS option being enabled? Thanks for the report and analysis! I believe pud_large() and pmd_large() should have been used here. I will try to reproduce the issue and verify the fix. -Toshi