On Fri 08-06-18 16:51:14, Dan Williams wrote: > In preparation for implementing support for memory poison (media error) > handling via dax mappings, implement a lock_page() equivalent. Poison > error handling requires rmap and needs guarantees that the page->mapping > association is maintained / valid (inode not freed) for the duration of > the lookup. > > In the device-dax case it is sufficient to simply hold a dev_pagemap > reference. In the filesystem-dax case we need to use the entry lock. > > Export the entry lock via dax_lock_page() that uses rcu_read_lock() to > protect against the inode being freed, and revalidates the page->mapping > association under xa_lock(). > > Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx> Some comments below... > diff --git a/fs/dax.c b/fs/dax.c > index cccf6cad1a7a..b7e71b108fcf 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -361,6 +361,82 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping, > } > } > > +struct page *dax_lock_page(unsigned long pfn) > +{ Why do you return struct page here? Any reason behind that? Because struct page exists and can be accessed through pfn_to_page() regardless of result of this function so it looks a bit confusing. Also dax_lock_page() name seems a bit confusing. Maybe dax_lock_pfn_mapping_entry()? > + pgoff_t index; > + struct inode *inode; > + wait_queue_head_t *wq; > + void *entry = NULL, **slot; > + struct address_space *mapping; > + struct wait_exceptional_entry_queue ewait; > + struct page *ret = NULL, *page = pfn_to_page(pfn); > + > + rcu_read_lock(); > + for (;;) { > + mapping = READ_ONCE(page->mapping); > + > + if (!mapping || !IS_DAX(mapping->host)) > + break; > + > + /* > + * In the device-dax case there's no need to lock, a > + * struct dev_pagemap pin is sufficient to keep the > + * inode alive. > + */ > + inode = mapping->host; > + if (S_ISCHR(inode->i_mode)) { > + ret = page; > + break; > + } > + > + xa_lock_irq(&mapping->i_pages); > + if (mapping != page->mapping) { > + xa_unlock_irq(&mapping->i_pages); > + continue; > + } > + index = page->index; > + > + init_wait(&ewait.wait); > + ewait.wait.func = wake_exceptional_entry_func; This initialization could be before the loop. > + > + entry = __radix_tree_lookup(&mapping->i_pages, index, NULL, > + &slot); > + if (!entry || > + WARN_ON_ONCE(!radix_tree_exceptional_entry(entry))) { > + xa_unlock_irq(&mapping->i_pages); > + break; > + } else if (!slot_locked(mapping, slot)) { > + lock_slot(mapping, slot); > + ret = page; > + xa_unlock_irq(&mapping->i_pages); > + break; > + } > + > + wq = dax_entry_waitqueue(mapping, index, entry, &ewait.key); > + prepare_to_wait_exclusive(wq, &ewait.wait, > + TASK_UNINTERRUPTIBLE); > + xa_unlock_irq(&mapping->i_pages); > + rcu_read_unlock(); > + schedule(); > + finish_wait(wq, &ewait.wait); > + rcu_read_lock(); > + } > + rcu_read_unlock(); I don't like how this duplicates a lot of get_unlocked_mapping_entry(). Can we possibly factor this out similary as done for wait_event()? Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR