On Tue, Jun 12, 2018 at 11:15 AM, Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx> wrote: > On Fri, Jun 08, 2018 at 04:51:14PM -0700, Dan Williams wrote: >> In preparation for implementing support for memory poison (media error) >> handling via dax mappings, implement a lock_page() equivalent. Poison >> error handling requires rmap and needs guarantees that the page->mapping >> association is maintained / valid (inode not freed) for the duration of >> the lookup. >> >> In the device-dax case it is sufficient to simply hold a dev_pagemap >> reference. In the filesystem-dax case we need to use the entry lock. >> >> Export the entry lock via dax_lock_page() that uses rcu_read_lock() to >> protect against the inode being freed, and revalidates the page->mapping >> association under xa_lock(). >> >> Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx> >> --- >> fs/dax.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++ >> include/linux/dax.h | 15 ++++++++++ >> 2 files changed, 91 insertions(+) >> >> diff --git a/fs/dax.c b/fs/dax.c >> index cccf6cad1a7a..b7e71b108fcf 100644 >> --- a/fs/dax.c >> +++ b/fs/dax.c >> @@ -361,6 +361,82 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping, >> } >> } >> >> +struct page *dax_lock_page(unsigned long pfn) >> +{ >> + pgoff_t index; >> + struct inode *inode; >> + wait_queue_head_t *wq; >> + void *entry = NULL, **slot; >> + struct address_space *mapping; >> + struct wait_exceptional_entry_queue ewait; >> + struct page *ret = NULL, *page = pfn_to_page(pfn); >> + >> + rcu_read_lock(); >> + for (;;) { >> + mapping = READ_ONCE(page->mapping); > > Why the READ_ONCE()? We're potentially racing inode teardown, so the READ_ONCE() prevents the compiler from trying to de-reference page->mapping twice and getting inconsistent answers. > >> + >> + if (!mapping || !IS_DAX(mapping->host)) > > Might read better using the dax_mapping() helper. Sure. > > Also, forgive my ignorance, but this implies that dev dax has page->mapping > set up and that that inode will have IS_DAX set, right? This will let us get > past this point for device DAX, and we'll bail out at the S_ISCHR() check? Yes. > >> + break; >> + >> + /* >> + * In the device-dax case there's no need to lock, a >> + * struct dev_pagemap pin is sufficient to keep the >> + * inode alive. >> + */ >> + inode = mapping->host; >> + if (S_ISCHR(inode->i_mode)) { >> + ret = page; > > 'ret' isn't actually used for anything in this function, we just > unconditionally return 'page'. > Yes, bug. >> + break; >> + } >> + >> + xa_lock_irq(&mapping->i_pages); >> + if (mapping != page->mapping) { >> + xa_unlock_irq(&mapping->i_pages); >> + continue; >> + } >> + index = page->index; >> + >> + init_wait(&ewait.wait); >> + ewait.wait.func = wake_exceptional_entry_func; >> + >> + entry = __radix_tree_lookup(&mapping->i_pages, index, NULL, >> + &slot); >> + if (!entry || > > So if we do a lookup and there is no entry in the tree, we won't add an empty > entry and lock it, we'll just return with no entry in the tree and nothing > locked. > > Then, when we call dax_unlock_page(), we'll eventually hit a WARN_ON_ONCE() in > dax_unlock_mapping_entry() when we see entry is 0. And, in that gap we've got > nothing locked so page faults could have happened, etc... (which would mean > that instead of WARN_ON_ONCE() for an empty entry, we'd hit it instead for an > unlocked entry). > > Is that okay? Or do we need to insert a locked empty entry here? No, the intent was to return NULL and fail the lock, but I messed up and unconditionally returned the page.