Re: can we finally kill off CONFIG_FS_DAX_LIMITED

Joao Martins <joao.m.martins@xxxxxxxxxx> · Tue, 19 Oct 2021 16:20:16 +0100

On 10/19/21 15:20, Jason Gunthorpe wrote:
> On Mon, Oct 18, 2021 at 09:26:24PM -0700, Dan Williams wrote:
>> On Mon, Oct 18, 2021 at 4:31 PM Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
>>> On Fri, Oct 15, 2021 at 01:22:41AM +0100, Joao Martins wrote:
>>> I'm not sure the comment is correct anyhow:
>>>
>>>                 /*
>>>                  * Unmap the largest mapping to avoid breaking up
>>>                  * device-dax mappings which are constant size. The
>>>                  * actual size of the mapping being torn down is
>>>                  * communicated in siginfo, see kill_proc()
>>>                  */
>>>                 unmap_mapping_range(page->mapping, start, size, 0);
>>>
>>> Beacuse for non PageAnon unmap_mapping_range() does either
>>> zap_huge_pud(), __split_huge_pmd(), or zap_huge_pmd().
>>>
>>> Despite it's name __split_huge_pmd() does not actually split, it will
>>> call __split_huge_pmd_locked:
>>>
>>>         } else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)))
>>>                 goto out;
>>>         __split_huge_pmd_locked(vma, pmd, range.start, freeze);
>>>
>>> Which does
>>>         if (!vma_is_anonymous(vma)) {
>>>                 old_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
>>>
>>> Which is a zap, not split.
>>>
>>> So I wonder if there is a reason to use anything other than 4k here
>>> for DAX?
>>>
>>>>       tk->size_shift = page_shift(compound_head(p));
>>>>
>>>> ... as page_shift() would just return PAGE_SHIFT (as compound_order() is 0).
>>>
>>> And what would be so wrong with memory failure doing this as a 4k
>>> page?
>>
>> device-dax does not support misaligned mappings. It makes hard
>> guarantees for applications that can not afford the page table
>> allocation overhead of sub-1GB mappings.
> 
> memory-failure is the wrong layer to enforce this anyhow - if someday
> unmap_mapping_range() did learn to break up the 1GB pages then we'd
> want to put the condition to preserve device-dax mappings there, not
> way up in memory-failure.
> 
> So we can just delete the detection of the page size and rely on the
> zap code to wipe out the entire level, not split it. Which is what we
> have today already.

On a quick note, wrt to @size_shift: memory-failure reflects it back to
userspace as contextual information (::addr_lsb) of the signal, when delivering
the intended SIGBUS(code=BUS_MCEERR_*). So the size needs to be reported
somehow.