dax_lock_mapping_entry was never safe

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Mon, 26 Nov 2018 08:12:40 -0800

I noticed this path while I was doing the 4.19 backport of
dax: Avoid losing wakeup in dax_lock_mapping_entry

                xa_unlock_irq(&mapping->i_pages);
                revalidate = wait_fn();
                finish_wait(wq, &ewait.wait);
                xa_lock_irq(&mapping->i_pages);

It's not safe to call xa_lock_irq() if mapping can have been freed while
we slept.  We'll probably get away with it; most filesystems use a unique
slab for their inodes, so you'll likely get either a freed inode or an
inode which is now the wrong inode.  But if that page has been freed back
to the page allocator, that pointer could now be pointing at anything.

Fixing this in the current codebase is no easier than fixing it in the
4.19 codebase.  This is the best I've come up with.  Could we do better
by not using the _exclusive form of prepare_to_wait()?  I'm not familiar
with all the things that need to be considered when using this family
of interfaces.

diff --git a/fs/dax.c b/fs/dax.c
index 9bcce89ea18e..154b592b18eb 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -232,6 +232,24 @@ static void *get_unlocked_entry(struct xa_state *xas)
 	}
 }
 
+static void wait_unlocked_entry(struct xa_state *xas, void *entry)
+{
+	struct wait_exceptional_entry_queue ewait;
+	wait_queue_head_t *wq;
+
+	init_wait(&ewait.wait);
+	ewait.wait.func = wake_exceptional_entry_func;
+
+	wq = dax_entry_waitqueue(xas, entry, &ewait.key);
+	prepare_to_wait_exclusive(wq, &ewait.wait, TASK_UNINTERRUPTIBLE);
+	xas_unlock_irq(xas);
+	/* We can no longer look at xas */
+	schedule();
+	finish_wait(wq, &ewait.wait);
+	if (waitqueue_active(wq))
+		__wake_up(wq, TASK_NORMAL, 1, &ewait.key);
+}
+
 static void put_unlocked_entry(struct xa_state *xas, void *entry)
 {
 	/* If we were the only waiter woken, wake the next one */
@@ -389,9 +407,7 @@ bool dax_lock_mapping_entry(struct page *page)
 		entry = xas_load(&xas);
 		if (dax_is_locked(entry)) {
 			rcu_read_unlock();
-			entry = get_unlocked_entry(&xas);
-			xas_unlock_irq(&xas);
-			put_unlocked_entry(&xas, entry);
+			wait_unlocked_entry(&xas, entry);
 			rcu_read_lock();
 			continue;
 		}