Re: [PATCH v4 17/18] mm, fs, dax: dax_flush_dma, handle dma vs block-map-change collisions

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jan 4, 2018 at 3:12 AM, Jan Kara <jack@xxxxxxx> wrote:
> On Sat 23-12-17 16:57:31, Dan Williams wrote:
>> +static struct page *dma_busy_page(void *entry)
>> +{
>> +     unsigned long pfn, end_pfn;
>> +
>> +     for_each_entry_pfn(entry, pfn, end_pfn) {
>> +             struct page *page = pfn_to_page(pfn);
>> +
>> +             if (page_ref_count(page) > 1)
>> +                     return page;
>> +     }
>> +     return NULL;
>> +}
>> +
>>  /*
>>   * Find radix tree entry at given index. If it points to an exceptional entry,
>>   * return it with the radix tree entry locked. If the radix tree doesn't
>> @@ -557,6 +570,87 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
>>       return entry;
>>  }
>>
>> +int dax_flush_dma(struct address_space *mapping, wait_atomic_t_action_f action)
>
> I don't quite like the 'dma' terminology when this is all about page
> references in fact. How about renaming like dma_busy_page() ->
> devmap_page_referenced() instead and dax_flush_dma() -> dax_wait_pages_unused()
> or something like that?

Sure, but this is moot given your better proposal below.

>
>> +{
>> +     pgoff_t indices[PAGEVEC_SIZE];
>> +     struct pagevec pvec;
>> +     pgoff_t index, end;
>> +     unsigned i;
>> +
>> +     /* in the limited case get_user_pages for dax is disabled */
>> +     if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
>> +             return 0;
>> +
>> +     if (!dax_mapping(mapping))
>> +             return 0;
>> +
>> +     if (mapping->nrexceptional == 0)
>> +             return 0;
>> +
>> +retry:
>> +     pagevec_init(&pvec);
>> +     index = 0;
>> +     end = -1;
>> +     unmap_mapping_range(mapping, 0, 0, 1);
>
> unmap_mapping_range() would be IMHO be more logical in the callers. Maybe
> a cleaner API would be like providing a function
> dax_find_referenced_page(mapping) which either returns NULL or a page that
> has elevated refcount. Filesystem can then drop locks it needs to and call
> wait_on_atomic_one() (possibly hidden in a DAX helper). When wait finishes,
> filesystem can do the retry. That way the whole lock, unlock, wait, retry
> logic is clearly visible in fs code, there's no need of 'action' function
> or propagation of locking state etc.

Yes, sounds better, I'll go this way.

>
>> +     /*
>> +      * Flush dax_dma_lock() sections to ensure all possible page
>> +      * references have been taken, or will block on the fs
>> +      * 'mmap_lock'.
>> +      */
>> +     synchronize_rcu();
>
> Frankly, I don't like synchronize_rcu() in a relatively hot path like this.
> Cannot we just abuse get_dev_pagemap() to fail if truncation is in progress
> for the pfn? We could indicate that by some bit in struct page or something
> like that.

We would need a lockless way to take a reference conditionally if the
page is not subject to truncation.

I recall the raid5 code did something similar where it split a
reference count into 2 fields. I.e. take page->_refcount and use the
upper bits as a truncation count. Something like:

do {
    old = atomic_read(&page->_refcount);
    if (old & trunc_mask) /* upper bits of _refcount */
        return false;
    new = cnt + 1;
} while (atomic_cmpxchg(&page->_refcount, old, new) != old);
return true; /* we incremented the _refcount while the truncation
count was zero */

...the only concern is teaching the put_page() path to consider that
'trunc_mask' when determining that the page is idle.

Other ideas?

>> +     while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
>> +                             min(end - index, (pgoff_t)PAGEVEC_SIZE),
>> +                             indices)) {
>> +             int rc = 0;
>> +
>> +             for (i = 0; i < pagevec_count(&pvec); i++) {
>> +                     struct page *pvec_ent = pvec.pages[i];
>> +                     struct page *page = NULL;
>> +                     void *entry;
>> +
>> +                     index = indices[i];
>> +                     if (index >= end)
>> +                             break;
>> +
>> +                     if (!radix_tree_exceptional_entry(pvec_ent))
>> +                             continue;
>
> This would be a bug so I'm not sure we need to handle that.

Sure, I can kill that check.



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux