Re: [PATCH] ext4: fix race condition between buffer write and page_mkwrite

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2023/5/30 15:42, Jan Kara wrote:
On Tue 30-05-23 10:00:44, Baokun Li wrote:
On 2023/5/29 22:44, Jan Kara wrote:
On Mon 29-05-23 16:01:48, Baokun Li wrote:
Syzbot reported a BUG_ON:
==================================================================
EXT4-fs (loop0): mounted filesystem without journal. Quota mode: none.
EXT4-fs error (device loop0): ext4_mb_generate_buddy:1098: group 0, block
       bitmap and bg descriptor inconsistent: 25 vs 150994969 free clusters
------------[ cut here ]------------
kernel BUG at fs/ext4/ext4_jbd2.c:53!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 494 Comm: syz-executor.0 6.1.0-rc7-syzkaller-ga4412fdd49dc #0
RIP: 0010:__ext4_journal_stop+0x1b3/0x1c0
   [...]
Call Trace:
   ext4_write_inline_data_end+0xa39/0xdf0
   ext4_da_write_end+0x1e2/0x950
   generic_perform_write+0x401/0x5f0
   ext4_buffered_write_iter+0x35f/0x640
   ext4_file_write_iter+0x198/0x1cd0
   vfs_write+0x8b5/0xef0
   [...]
==================================================================

The above BUG_ON is triggered by the following race:

             cpu1                    cpu2
________________________|________________________
ksys_write
   vfs_write
    new_sync_write
     ext4_file_write_iter
      ext4_buffered_write_iter
       generic_perform_write
        ext4_da_write_begin
                            do_fault
                             do_page_mkwrite
                              ext4_page_mkwrite
                               ext4_convert_inline_data
                                ext4_convert_inline_data_nolock
                                 ext4_destroy_inline_data_nolock
                                  //clear EXT4_STATE_MAY_INLINE_DATA
                                 ext4_map_blocks --> return error
         ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)
         ext4_block_write_begin
                                 ext4_restore_inline_data
                                  // set EXT4_STATE_MAY_INLINE_DATA
        ext4_da_write_end
         ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)
         ext4_write_inline_data_end
          handle=NULL
          ext4_journal_stop(handle)
           __ext4_journal_stop
            ext4_put_nojournal(handle)
             ref_cnt = (unsigned long)handle
             BUG_ON(ref_cnt == 0)  ---> BUG_ON

The root cause of this problem is that the ext4_convert_inline_data() in
ext4_page_mkwrite() does not grab i_rwsem, so it may race with
ext4_buffered_write_iter() and cause the write_begin() and write_end()
functions to be inconsistent and trigger BUG_ON.

To solve the above issue, we cannot add inode_lock directly to
ext4_page_mkwrite(), because this function is a hot path and frequent calls
to inode_lock will cause performance degradation for multi-threaded reads
and writes. Hence, we move ext4_convert_inline_data() to ext4_file_mmap(),
and only when inline_data is enabled and mmap a file in shared write mode,
we hold the lock to convert, which can reduce the impact on performance.

Reported-by: Jun Nie <jun.nie@xxxxxxxxxx>
Closes: https://lore.kernel.org/lkml/63903521.5040307@xxxxxxxxxx/t/
Reported-by: syzbot+a158d886ca08a3fecca4@xxxxxxxxxxxxxxxxxxxxxxxxx
Closes: https://syzkaller.appspot.com/bug?id=899b37f20ce4072bcdfecfe1647b39602e956e36
Fixes: 7b4cc9787fe3 ("ext4: evict inline data when writing to memory map")
CC: stable@xxxxxxxxxxxxxxx # 4.12+
Signed-off-by: Baokun Li <libaokun1@xxxxxxxxxx>
Thanks for the patch! The problem with i_rwsem in ext4_page_mkwrite() is
not so much about performance as about lock ordering. In
ext4_page_mkwrite() we are called with mmap_sem held and so we cannot
acquire i_rwsem because it ranks about it.
Thank you for your review!

I'm sorry I didn't make myself clear here.

Yes, we can't get i_rwsem after holding mmap_sem at any time, otherwise
ABBA deadlock may occur. The "add inode_lock directly" in my patch
description actually looks like this:
```
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b98d2d58b900..c9318dc2a613 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -6025,12 +6025,14 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
         sb_start_pagefault(inode->i_sb);
         file_update_time(vma->vm_file);

-       filemap_invalidate_lock_shared(mapping);
-
+       inode_lock(inode);
         err = ext4_convert_inline_data(inode);
+       inode_unlock(inode);
         if (err)
                 goto out_ret;

+       filemap_invalidate_lock_shared(mapping);
+
         /*
          * On data journalling we skip straight to the transaction handle:
          * there's no delalloc; page truncated will be checked later; the
```
Yes, but even this could deadlock. The ABBA deadlock I'm speaking about
would not be created with mapping->invalidate_lock but rather with
task->mm->mmap_lock which is acquired at the beginning of page fault in the
arch code.

Thanks for the explanation!

I thought the mmap_sem said was "&EXT4_I(inode)->i_mmap_sem".

But here are some more questions:

1) In the arch code page fault is handled by mmap_read_lock(mm) to get the
   shared lock, why would this lead to ABBA deadlock?

2) Why would page fault be triggered in the write process?

Could you explain it in more detail?

And when you do write(2), you hold inode_lock() and then you
copy data from the use provided buffer to the pagecache pages and that can
cause a page fault on the user provided buffer which will try to grab
task->mm->mmap_lock.

This lock inversion is the main reason why inode lock cannot be used
anywhere in the page fault path.

								Honza

Looking forward to hearing from you! 🤔

--
With Best Regards,
Baokun Li
.



[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux