Re: write fault on dax mapping and usage of set_pte_at.

"Aneesh Kumar K.V" <aneesh.kumar@xxxxxxxxxxxxx> · Thu, 21 Feb 2019 19:11:14 +0530

On 2/21/19 5:42 PM, Jan Kara wrote:
Hi Aneesh,

On Thu 21-02-19 12:52:39, Aneesh Kumar K.V wrote:
We found this while testing dax with XFS, but i guess this is true for
other file systems too. The stack trace looks as

  [c00000000007610c] set_pte_at+0x3c/0x190
LR [c000000000378628] insert_pfn+0x208/0x280
Call Trace:
[c0000002125df980] [8000000000000104] 0x8000000000000104 (unreliable)
[c0000002125df9c0] [c000000000378488] insert_pfn+0x68/0x280
[c0000002125dfa30] [c0000000004a5494] dax_iomap_pte_fault.isra.7+0x734/0xa40
[c0000002125dfb50] [c000000000627250] __xfs_filemap_fault+0x280/0x2d0
[c0000002125dfbb0] [c000000000373abc] do_wp_page+0x48c/0xa40
[c0000002125dfc00] [c000000000379170] __handle_mm_fault+0x8d0/0x1fd0
[c0000002125dfd00] [c00000000037a9b0] handle_mm_fault+0x140/0x250
[c0000002125dfd40] [c000000000074bb0] __do_page_fault+0x300/0xd60
[c0000002125dfe20] [c00000000000acf4] handle_page_fault+0x18

Now that is WARN_ON in set_pte_at which is

	VM_WARN_ON(pte_hw_valid(*ptep) && !pte_protnone(*ptep));

Multiple architecture optimize set_pte_at based on the assumption that
we will never use set_pte_at to update a valid pte entry. This helps in
avoid flushing tlb etc. We should be using ptep_set_access_flags for
this.

Hum, I didn't know about this assumption and neither did lot of other
people reviewing DAX patches. Is this documented somewhere?

I guess iomap code doesn't handle this correctly? Or am I missing
some other ways we can end up flushing tlb?

So for RW->RO transition we use ptep_clear_flush() in dax_entry_mkclean()
so that one is certainly safe. Similarly for unmapping. The RO->RW
transition does not seem to have any TLB flush so there TLB could still
carry stale information but it's the same as with normal page faults on
invalid PTEs or with protection faults for normal pages (see e.g.
finish_mkwrite_fault()). 

I am not sure i understood that. RO -> RW transition can have stale TLB 
entries with RO mapping in them. So architecture do flush TLB during the 
fault, some may not. We do have a NestMMU issue with that transition 
which requires us to do mark the pte invalid and flush TLB for that 
transition. (see commit bd5050e38aec3055ff4257ade987d808ac93b582 )

For invalid PTEs we should have a TLB entry at all.

finish_mkwrite_fault do use wp_page_reuse() which does the right thing.

The only thing that's remaining is a situation
when we replace a PTE with zero page with a PTE pointing to a real storage
(block allocation on protection fault). However in this case we do
unmap_mapping_pages() in dax_insert_entry() so the PTE actually gets
cleared before we install a new correct block mapping. So this case is safe
as well. Am I missing something?

Do pfn_mkwrite callback need to insert the pfn details for a RO->RW 
fault type? Can't we skip that pfn insert and let finish_mkwrite_fault 
handle that pte update?

-aneesh