Re: [PATCH] ext4: fix race condition between buffer write and page_mkwrite

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2023/5/30 18:28, Jan Kara wrote:
On Tue 30-05-23 17:39:16, Baokun Li wrote:
On 2023/5/30 15:42, Jan Kara wrote:
On Tue 30-05-23 10:00:44, Baokun Li wrote:
On 2023/5/29 22:44, Jan Kara wrote:
On Mon 29-05-23 16:01:48, Baokun Li wrote:
Syzbot reported a BUG_ON:
==================================================================
EXT4-fs (loop0): mounted filesystem without journal. Quota mode: none.
EXT4-fs error (device loop0): ext4_mb_generate_buddy:1098: group 0, block
        bitmap and bg descriptor inconsistent: 25 vs 150994969 free clusters
------------[ cut here ]------------
kernel BUG at fs/ext4/ext4_jbd2.c:53!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 494 Comm: syz-executor.0 6.1.0-rc7-syzkaller-ga4412fdd49dc #0
RIP: 0010:__ext4_journal_stop+0x1b3/0x1c0
    [...]
Call Trace:
    ext4_write_inline_data_end+0xa39/0xdf0
    ext4_da_write_end+0x1e2/0x950
    generic_perform_write+0x401/0x5f0
    ext4_buffered_write_iter+0x35f/0x640
    ext4_file_write_iter+0x198/0x1cd0
    vfs_write+0x8b5/0xef0
    [...]
==================================================================

The above BUG_ON is triggered by the following race:

              cpu1                    cpu2
________________________|________________________
ksys_write
    vfs_write
     new_sync_write
      ext4_file_write_iter
       ext4_buffered_write_iter
        generic_perform_write
         ext4_da_write_begin
                             do_fault
                              do_page_mkwrite
                               ext4_page_mkwrite
                                ext4_convert_inline_data
                                 ext4_convert_inline_data_nolock
                                  ext4_destroy_inline_data_nolock
                                   //clear EXT4_STATE_MAY_INLINE_DATA
                                  ext4_map_blocks --> return error
          ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)
          ext4_block_write_begin
                                  ext4_restore_inline_data
                                   // set EXT4_STATE_MAY_INLINE_DATA
         ext4_da_write_end
          ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)
          ext4_write_inline_data_end
           handle=NULL
           ext4_journal_stop(handle)
            __ext4_journal_stop
             ext4_put_nojournal(handle)
              ref_cnt = (unsigned long)handle
              BUG_ON(ref_cnt == 0)  ---> BUG_ON

The root cause of this problem is that the ext4_convert_inline_data() in
ext4_page_mkwrite() does not grab i_rwsem, so it may race with
ext4_buffered_write_iter() and cause the write_begin() and write_end()
functions to be inconsistent and trigger BUG_ON.

To solve the above issue, we cannot add inode_lock directly to
ext4_page_mkwrite(), because this function is a hot path and frequent calls
to inode_lock will cause performance degradation for multi-threaded reads
and writes. Hence, we move ext4_convert_inline_data() to ext4_file_mmap(),
and only when inline_data is enabled and mmap a file in shared write mode,
we hold the lock to convert, which can reduce the impact on performance.

Reported-by: Jun Nie <jun.nie@xxxxxxxxxx>
Closes: https://lore.kernel.org/lkml/63903521.5040307@xxxxxxxxxx/t/
Reported-by: syzbot+a158d886ca08a3fecca4@xxxxxxxxxxxxxxxxxxxxxxxxx
Closes: https://syzkaller.appspot.com/bug?id=899b37f20ce4072bcdfecfe1647b39602e956e36
Fixes: 7b4cc9787fe3 ("ext4: evict inline data when writing to memory map")
CC: stable@xxxxxxxxxxxxxxx # 4.12+
Signed-off-by: Baokun Li <libaokun1@xxxxxxxxxx>
Thanks for the patch! The problem with i_rwsem in ext4_page_mkwrite() is
not so much about performance as about lock ordering. In
ext4_page_mkwrite() we are called with mmap_sem held and so we cannot
acquire i_rwsem because it ranks about it.
Thank you for your review!

I'm sorry I didn't make myself clear here.

Yes, we can't get i_rwsem after holding mmap_sem at any time, otherwise
ABBA deadlock may occur. The "add inode_lock directly" in my patch
description actually looks like this:
```
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b98d2d58b900..c9318dc2a613 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -6025,12 +6025,14 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
          sb_start_pagefault(inode->i_sb);
          file_update_time(vma->vm_file);

-       filemap_invalidate_lock_shared(mapping);
-
+       inode_lock(inode);
          err = ext4_convert_inline_data(inode);
+       inode_unlock(inode);
          if (err)
                  goto out_ret;

+       filemap_invalidate_lock_shared(mapping);
+
          /*
           * On data journalling we skip straight to the transaction handle:
           * there's no delalloc; page truncated will be checked later; the
```
Yes, but even this could deadlock. The ABBA deadlock I'm speaking about
would not be created with mapping->invalidate_lock but rather with
task->mm->mmap_lock which is acquired at the beginning of page fault in the
arch code.
Thanks for the explanation!

I thought the mmap_sem said was "&EXT4_I(inode)->i_mmap_sem".
We don't have that one anymore :). But yes, there are too many mmap locks so
I should have been more precise from the start. I'm sorry for the
confusion.

But here are some more questions:

1) In the arch code page fault is handled by mmap_read_lock(mm) to get the
    shared lock, why would this lead to ABBA deadlock?

2) Why would page fault be triggered in the write process?

Could you explain it in more detail?
OK, let me draw here the exact deadlock graph:

CPU0				CPU1			CPU2
write(2)			page fault		mmap(2)
				  do_user_addr_fault()
				    mmap_read_lock()      ...
				    ...                   vm_mmap_pgoff()
							    mmap_write_lock_killable(mm)
							      - blocks on CPU1
								holding the lock
   ext4_buffered_write_iter()
     inode_lock(inode);
     ...
     generic_perform_write(iocb, from);
       fault_in_iov_iter_readable(i, bytes)
         - hits pagefault
	do_user_addr_fault()
	  mmap_read_lock()
	    - blocks on CPU2 as writer queued first
				    ext4_page_mkwrite()
				      - would deadlock here if we
				        try to take inode lock

								Honza

Got it! This graph is very clear!  👍

Thank you very much for your patient and detailed explanation!

--
With Best Regards,
Baokun Li
.



[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux