On Mon, Jun 11, 2018 at 8:41 AM, Jan Kara <jack@xxxxxxx> wrote: > On Fri 08-06-18 16:51:14, Dan Williams wrote: >> In preparation for implementing support for memory poison (media error) >> handling via dax mappings, implement a lock_page() equivalent. Poison >> error handling requires rmap and needs guarantees that the page->mapping >> association is maintained / valid (inode not freed) for the duration of >> the lookup. >> >> In the device-dax case it is sufficient to simply hold a dev_pagemap >> reference. In the filesystem-dax case we need to use the entry lock. >> >> Export the entry lock via dax_lock_page() that uses rcu_read_lock() to >> protect against the inode being freed, and revalidates the page->mapping >> association under xa_lock(). >> >> Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx> > > Some comments below... > >> diff --git a/fs/dax.c b/fs/dax.c >> index cccf6cad1a7a..b7e71b108fcf 100644 >> --- a/fs/dax.c >> +++ b/fs/dax.c >> @@ -361,6 +361,82 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping, >> } >> } >> >> +struct page *dax_lock_page(unsigned long pfn) >> +{ > > Why do you return struct page here? Any reason behind that? Because struct > page exists and can be accessed through pfn_to_page() regardless of result > of this function so it looks a bit confusing. Also dax_lock_page() name > seems a bit confusing. Maybe dax_lock_pfn_mapping_entry()? > >> + pgoff_t index; >> + struct inode *inode; >> + wait_queue_head_t *wq; >> + void *entry = NULL, **slot; >> + struct address_space *mapping; >> + struct wait_exceptional_entry_queue ewait; >> + struct page *ret = NULL, *page = pfn_to_page(pfn); >> + >> + rcu_read_lock(); >> + for (;;) { >> + mapping = READ_ONCE(page->mapping); >> + >> + if (!mapping || !IS_DAX(mapping->host)) >> + break; >> + >> + /* >> + * In the device-dax case there's no need to lock, a >> + * struct dev_pagemap pin is sufficient to keep the >> + * inode alive. >> + */ >> + inode = mapping->host; >> + if (S_ISCHR(inode->i_mode)) { >> + ret = page; >> + break; >> + } >> + >> + xa_lock_irq(&mapping->i_pages); >> + if (mapping != page->mapping) { >> + xa_unlock_irq(&mapping->i_pages); >> + continue; >> + } >> + index = page->index; >> + >> + init_wait(&ewait.wait); >> + ewait.wait.func = wake_exceptional_entry_func; > > This initialization could be before the loop. > >> + >> + entry = __radix_tree_lookup(&mapping->i_pages, index, NULL, >> + &slot); >> + if (!entry || >> + WARN_ON_ONCE(!radix_tree_exceptional_entry(entry))) { >> + xa_unlock_irq(&mapping->i_pages); >> + break; >> + } else if (!slot_locked(mapping, slot)) { >> + lock_slot(mapping, slot); >> + ret = page; >> + xa_unlock_irq(&mapping->i_pages); >> + break; >> + } >> + >> + wq = dax_entry_waitqueue(mapping, index, entry, &ewait.key); >> + prepare_to_wait_exclusive(wq, &ewait.wait, >> + TASK_UNINTERRUPTIBLE); >> + xa_unlock_irq(&mapping->i_pages); >> + rcu_read_unlock(); >> + schedule(); >> + finish_wait(wq, &ewait.wait); >> + rcu_read_lock(); >> + } >> + rcu_read_unlock(); > > I don't like how this duplicates a lot of get_unlocked_mapping_entry(). > Can we possibly factor this out similary as done for wait_event()? Ok, I'll give that a shot.