The IOCTL path which switches the journaling mode for an inode is currently unsafe because it doesn't properly do a writeback and invalidation on the inode. In XFS, for example, safe transitions of S_DAX are handled by xfs_ioctl_setattr_dax_invalidate() which locks out page faults and I/O, does a writeback via filemap_write_and_wait() and an invalidation via invalidate_inode_pages2(). Without this in place we can see the following kernel warning when we try and insert a DAX exceptional entry but find that a dirty page cache page is still in the mapping->radix_tree: WARNING: CPU: 4 PID: 1052 at mm/filemap.c:262 __delete_from_page_cache+0x375/0x550 Modules linked in: dax_pmem nd_pmem device_dax nd_btt nfit libnvdimm CPU: 4 PID: 1052 Comm: small Not tainted 4.13.0-rc6-00055-gac26931 #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014 task: ffff88020ccd0000 task.stack: ffffc900021d4000 RIP: 0010:__delete_from_page_cache+0x375/0x550 RSP: 0000:ffffc900021d7b90 EFLAGS: 00010002 RAX: 002fffc00001123d RBX: ffffffffffffffff RCX: ffff8801d9440d68 RDX: 0000000000000000 RSI: ffffffff81fd5b84 RDI: ffffffff81f6f0e5 RBP: ffffc900021d7be0 R08: 0000000000000000 R09: ffff8801f9938c70 R10: 0000000000000021 R11: ffff8801f9938c91 R12: ffff8801d9440d70 R13: ffffea0007fdda80 R14: 0000000000000001 R15: ffff8801d9440d68 FS: 00007feacc041700(0000) GS:ffff880211800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000010420000 CR3: 000000020cfd8000 CR4: 00000000000006e0 Call Trace: dax_insert_mapping_entry+0x158/0x2c0 dax_iomap_fault+0x1020/0x1bb0 ext4_dax_huge_fault+0xc8/0x160 ext4_dax_fault+0x10/0x20 __do_fault+0x20/0x110 __handle_mm_fault+0x97d/0x1120 handle_mm_fault+0x188/0x2f0 __do_page_fault+0x28f/0x590 trace_do_page_fault+0x58/0x2c0 do_async_page_fault+0x2c/0x90 async_page_fault+0x28/0x30 I'm pretty sure we could make a test that shows userspace visible data corruption as well in this scenario. Make it safe to change the journaling mode and turn on or off S_DAX by adding locking to properly lock out page faults (i_mmap_sem) and then doing the writeback and invalidate. I/O is already held off because all callers of ext4_ioctl_setflags() hold the inode lock. The locking for this new code is complex because of the following: 1) filemap_write_and_wait() eventually calls ext4_writepages(), which acquires the sbi->s_journal_flag_rwsem. This lock ranks above the jbdw_handle which is eventually taken by ext4_journal_start(). This essentially means that the writeback has to happen outside of the context of an active journal handle (outside of ext4_journal_start() to ext4_journal_stop().) 2) To lock out page faults we take a write lock on the ei->i_mmap_sem, and this lock again ranks above the jbd2_handle taken by ext4_journal_start(). So, as with the writeback code in 1) above we have to take ei->i_mmap_sem outside of the context of an active journal handle. Signed-off-by: Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx> CC: stable@xxxxxxxxxxxxxxx --- Please let me know if my "context of an active journal handle" terminology makes sense, or if some other phrasing is more common. I'll work on an xfstest which shows this data corruption. --- fs/ext4/inode.c | 25 +++++++++++++++---------- 1 file changed, 15 insertions(+), 10 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index d218991..facb5ae 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -5949,13 +5949,15 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val) * fallocate or buffered writes in dioread_nolock mode covered by * dirty data which can be converted only after flushing the dirty * data (and journalled aops don't know how to handle these cases). + * + * Because the changes to our journaling mode may also modify S_DAX we + * need to hold off I/O and page faults so we can write back any dirty + * data and invalidate all mappings. */ - if (val) { - down_write(&EXT4_I(inode)->i_mmap_sem); - err = filemap_write_and_wait(inode->i_mapping); - if (err < 0) - goto out_unlock; - } + down_write(&EXT4_I(inode)->i_mmap_sem); + err = filemap_write_and_wait(inode->i_mapping); + if (err < 0) + goto out_unlock; percpu_down_write(&sbi->s_journal_flag_rwsem); jbd2_journal_lock_updates(journal); @@ -5976,6 +5978,11 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val) goto out_journal_unlock; ext4_clear_inode_flag(inode, EXT4_INODE_JOURNAL_DATA); } + + err = invalidate_inode_pages2(inode->i_mapping); + if (err < 0) + goto out_journal_unlock; + ext4_set_aops(inode); /* * Update inode->i_flags after EXT4_INODE_JOURNAL_DATA was updated. @@ -5986,8 +5993,7 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val) jbd2_journal_unlock_updates(journal); percpu_up_write(&sbi->s_journal_flag_rwsem); - if (val) - up_write(&EXT4_I(inode)->i_mmap_sem); + up_write(&EXT4_I(inode)->i_mmap_sem); ext4_inode_resume_unlocked_dio(inode); /* Finally we can mark the inode as dirty. */ @@ -6007,8 +6013,7 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val) jbd2_journal_unlock_updates(journal); percpu_up_write(&sbi->s_journal_flag_rwsem); out_unlock: - if (val) - up_write(&EXT4_I(inode)->i_mmap_sem); + up_write(&EXT4_I(inode)->i_mmap_sem); ext4_inode_resume_unlocked_dio(inode); return err; } -- 2.9.5