Re: [PATCH v2] Btrfs: fix race waiting for ordered extents at transaction commit

Filipe David Manana <fdmanana@xxxxxxxxx> · Mon, 9 Feb 2015 17:18:44 +0000

On Mon, Feb 9, 2015 at 12:21 PM, Filipe Manana <fdmanana@xxxxxxxx> wrote:
> There's a short time window where a race can happen between two or more
> tasks that hold a transaction handle for the same transaction and where
> one starts the transaction commit before the other tasks attempt to
> split their pending ordered extents list into the transaction's pending
> ordered extents lists. This results in the transaction commit not waiting
> for those ordered extents to complete, in memory leaks of ordered extent
> structures and therefore inode leaks too, since an iput for the ordered
> extent's inode is done only when the ordered extent's refcount drops to
> zero. This race is described by the following sequence diagram:
>
>          CPU 1                                           CPU 2
>
> btrfs_start_transaction()
>    started transaction N with
>    trans->transaction->num_writers == 1
>    and trans->transaction->state ==
>      TRANS_STATE_RUNNING
>
>                                                      btrfs_sync_file()
>
>                                                        btrfs_start_transaction()
>                                                          --> returns transaction
>                                                              handle pointing to
>                                                              transaction N
>                                                          --> Now transaction N's
>                                                              num_writers == 2
>
>                                                        btrfs_sync_log()
>
> btrfs_commit_transaction()
>    btrfs_wait_pending_ordered()
>       --> transaction N's ->pending_ordered
>           processed and is now an empty list
>    set transaction state to TRANS_STATE_COMMIT_DOING
>    wait for trans->transaction->num_writers == 1
>
>                                                          btrfs_wait_logged_extents()
>                                                             --> adds ordered extents
>                                                                 to trans->ordered list
>
>                                                        btrfs_end_transaction()
>                                                          --> trans->ordered list is spliced
>                                                              into transaction N's list
>                                                              pending_ordered
>                                                          --> transaction N's num_writers
>                                                              becomes 1 now
>
>   wait finished, num_writers == 1
>   transaction is committed and it doesn't wait
>   for the ordered extents from CPU 2's task to
>   complete, nor does it decrement their last
>   reference, resulting in memory leaks and
>   inode leaks (the iput on the ordered extent's
>   inode is done only when the ordered extent's
>   refcount drops to zero)
>
> So fix this by processing the transaction's pending_ordered list again
> after the number of writers decreases to 1.
>
> I ran into this issue while running xfstests/generic/113 in a loop, which
> failed about 1 out of 10 runs with the following warning in dmesg:
>
> [ 2612.440038] WARNING: CPU: 4 PID: 22057 at fs/btrfs/disk-io.c:3558 free_fs_root+0x36/0x133 [btrfs]()
> [ 2612.442810] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop processor parport_pc parport psmouse thermal_sys i2c_piix4 serio_raw pcspkr evdev microcode button i2c_core ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic virtio_pci ata_piix virtio_ring libata virtio floppy e1000 scsi_mod [last unloaded: btrfs]
> [ 2612.452711] CPU: 4 PID: 22057 Comm: umount Tainted: G        W      3.19.0-rc5-btrfs-next-4+ #1
> [ 2612.454921] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
> [ 2612.457709]  0000000000000009 ffff8801342c3c78 ffffffff8142425e ffff88023ec8f2d8
> [ 2612.459829]  0000000000000000 ffff8801342c3cb8 ffffffff81045308 ffff880046460000
> [ 2612.461564]  ffffffffa036da56 ffff88003d07b000 ffff880046460000 ffff880046460068
> [ 2612.463163] Call Trace:
> [ 2612.463719]  [<ffffffff8142425e>] dump_stack+0x4c/0x65
> [ 2612.464789]  [<ffffffff81045308>] warn_slowpath_common+0xa1/0xbb
> [ 2612.466026]  [<ffffffffa036da56>] ? free_fs_root+0x36/0x133 [btrfs]
> [ 2612.467247]  [<ffffffff810453c5>] warn_slowpath_null+0x1a/0x1c
> [ 2612.468416]  [<ffffffffa036da56>] free_fs_root+0x36/0x133 [btrfs]
> [ 2612.469625]  [<ffffffffa036f2a7>] btrfs_drop_and_free_fs_root+0x93/0x9b [btrfs]
> [ 2612.471251]  [<ffffffffa036f353>] btrfs_free_fs_roots+0xa4/0xd6 [btrfs]
> [ 2612.472536]  [<ffffffff8142612e>] ? wait_for_completion+0x24/0x26
> [ 2612.473742]  [<ffffffffa0370bbc>] close_ctree+0x1f3/0x33c [btrfs]
> [ 2612.475477]  [<ffffffff81059d1d>] ? destroy_workqueue+0x148/0x1ba
> [ 2612.476695]  [<ffffffffa034e3da>] btrfs_put_super+0x19/0x1b [btrfs]
> [ 2612.477911]  [<ffffffff81153e53>] generic_shutdown_super+0x73/0xef
> [ 2612.479106]  [<ffffffff811540e2>] kill_anon_super+0x13/0x1e
> [ 2612.480226]  [<ffffffffa034e1e3>] btrfs_kill_super+0x17/0x23 [btrfs]
> [ 2612.481471]  [<ffffffff81154307>] deactivate_locked_super+0x3b/0x50
> [ 2612.482686]  [<ffffffff811547a7>] deactivate_super+0x3f/0x43
> [ 2612.483791]  [<ffffffff8116b3ed>] cleanup_mnt+0x59/0x78
> [ 2612.484842]  [<ffffffff8116b44c>] __cleanup_mnt+0x12/0x14
> [ 2612.485900]  [<ffffffff8105d019>] task_work_run+0x8f/0xbc
> [ 2612.486960]  [<ffffffff810028d8>] do_notify_resume+0x5a/0x6b
> [ 2612.488083]  [<ffffffff81236e5b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [ 2612.489333]  [<ffffffff8142a17f>] int_signal+0x12/0x17
> [ 2612.490353] ---[ end trace 54a960a6bdcb8d93 ]---
> [ 2612.557253] VFS: Busy inodes after unmount of sdb. Self-destruct in 5 seconds.  Have a nice day...
>
> This happened because evict_inodes (VFS) skipped eviction of one inode
> because that inode had an i_count > 0. This prevented btrfs_destroy_inode
> from being called against that inode, and the non-zero i_count was because
> one ordered extent didn't get its refcount dropped to 0, which is when a
> delayed iput is scheduled for the inode. Kmemleak also complained about
> this, namely the leak of the ordered extent structure and btrfs inode
> specific structures, like the delayed inode:
>
> $ cat /sys/kernel/debug/kmemleak
> unreferenced object 0xffff880154290db0 (size 576):
>   comm "btrfsck", pid 21980, jiffies 4295542503 (age 1273.412s)
>   hex dump (first 32 bytes):
>     01 40 00 00 01 00 00 00 b0 1d f1 4e 01 88 ff ff  .@.........N....
>     00 00 00 00 00 00 00 00 c8 0d 29 54 01 88 ff ff  ..........)T....
>   backtrace:
>     [<ffffffff8141d74d>] kmemleak_update_trace+0x4c/0x6a
>     [<ffffffff8122f2c0>] radix_tree_node_alloc+0x6d/0x83
>     [<ffffffff8122fb26>] __radix_tree_create+0x109/0x190
>     [<ffffffff8122fbdd>] radix_tree_insert+0x30/0xac
>     [<ffffffffa03b9bde>] btrfs_get_or_create_delayed_node+0x130/0x187 [btrfs]
>     [<ffffffffa03bb82d>] btrfs_delayed_delete_inode_ref+0x32/0xac [btrfs]
>     [<ffffffffa0379dae>] __btrfs_unlink_inode+0xee/0x288 [btrfs]
>     [<ffffffffa037c715>] btrfs_unlink_inode+0x1e/0x40 [btrfs]
>     [<ffffffffa037c797>] btrfs_unlink+0x60/0x9b [btrfs]
>     [<ffffffff8115d7f0>] vfs_unlink+0x9c/0xed
>     [<ffffffff8115f5de>] do_unlinkat+0x12c/0x1fa
>     [<ffffffff811601a7>] SyS_unlinkat+0x29/0x2b
>     [<ffffffff81429e92>] system_call_fastpath+0x12/0x17
>     [<ffffffffffffffff>] 0xffffffffffffffff
> unreferenced object 0xffff88014ef11db0 (size 576):
>   comm "rm", pid 22009, jiffies 4295542593 (age 1273.052s)
>   hex dump (first 32 bytes):
>     02 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  ................
>     00 00 00 00 00 00 00 00 c8 1d f1 4e 01 88 ff ff  ...........N....
>   backtrace:
>     [<ffffffff8141d74d>] kmemleak_update_trace+0x4c/0x6a
>     [<ffffffff8122f2c0>] radix_tree_node_alloc+0x6d/0x83
>     [<ffffffff8122fb26>] __radix_tree_create+0x109/0x190
>     [<ffffffff8122fbdd>] radix_tree_insert+0x30/0xac
>     [<ffffffffa03b9bde>] btrfs_get_or_create_delayed_node+0x130/0x187 [btrfs]
>     [<ffffffffa03bb82d>] btrfs_delayed_delete_inode_ref+0x32/0xac [btrfs]
>     [<ffffffffa0379dae>] __btrfs_unlink_inode+0xee/0x288 [btrfs]
>     [<ffffffffa037c715>] btrfs_unlink_inode+0x1e/0x40 [btrfs]
>     [<ffffffffa037c797>] btrfs_unlink+0x60/0x9b [btrfs]
>     [<ffffffff8115d7f0>] vfs_unlink+0x9c/0xed
>     [<ffffffff8115f5de>] do_unlinkat+0x12c/0x1fa
>     [<ffffffff811601a7>] SyS_unlinkat+0x29/0x2b
>     [<ffffffff81429e92>] system_call_fastpath+0x12/0x17
>     [<ffffffffffffffff>] 0xffffffffffffffff
> unreferenced object 0xffff8800336feda8 (size 584):
>   comm "aio-stress", pid 22031, jiffies 4295543006 (age 1271.400s)
>   hex dump (first 32 bytes):
>     00 40 3e 00 00 00 00 00 00 00 8f 42 00 00 00 00  .@>........B....
>     00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00  ................
>   backtrace:
>     [<ffffffff8114eb34>] create_object+0x172/0x29a
>     [<ffffffff8141d790>] kmemleak_alloc+0x25/0x41
>     [<ffffffff81141ae6>] kmemleak_alloc_recursive.constprop.52+0x16/0x18
>     [<ffffffff81145288>] kmem_cache_alloc+0xf7/0x198
>     [<ffffffffa0389243>] __btrfs_add_ordered_extent+0x43/0x309 [btrfs]
>     [<ffffffffa038968b>] btrfs_add_ordered_extent_dio+0x12/0x14 [btrfs]
>     [<ffffffffa03810e2>] btrfs_get_blocks_direct+0x3ef/0x571 [btrfs]
>     [<ffffffff81181349>] do_blockdev_direct_IO+0x62a/0xb47
>     [<ffffffff8118189a>] __blockdev_direct_IO+0x34/0x36
>     [<ffffffffa03776e5>] btrfs_direct_IO+0x16a/0x1e8 [btrfs]
>     [<ffffffff81100373>] generic_file_direct_write+0xb8/0x12d
>     [<ffffffffa038615c>] btrfs_file_write_iter+0x24b/0x42f [btrfs]
>     [<ffffffff8118bb0d>] aio_run_iocb+0x2b7/0x32e
>     [<ffffffff8118c99a>] do_io_submit+0x26e/0x2ff
>     [<ffffffff8118ca3b>] SyS_io_submit+0x10/0x12
>     [<ffffffff81429e92>] system_call_fastpath+0x12/0x17
>
> This race was introduced with the following change (added in 3.19 and
> backported to stable 3.18 and 3.17):
>
>   Btrfs: make sure logged extents complete in the current transaction V3
>   commit 50d9aa99bd35c77200e0e3dd7a72274f8304701f
>
> CC: <stable@xxxxxxxxxxxxxxx> # 3.19, 3.18 and 3.17
> Signed-off-by: Filipe Manana <fdmanana@xxxxxxxx>
> ---

Please ignore this patch. It accomplishes nothing actually, and the
problem is entirely solved by my other patch titled:

"Btrfs: fix fsync race leading to ordered extent memory leaks"
(https://patchwork.kernel.org/patch/5802281/)

>
> V2: Changed approach to leave the current wait for pending ordered
>     extents before waiting for the transaction's number of writers to
>     decrease to 1 and added a second wait after to process any ordered
>     extents that might have been added by other tasks while waiting
>     for the number of writers to decrease to 1. This helps reducing
>     the amount of time the transaction is in the state
>     TRANS_STATE_COMMIT_DOING.
>
>  fs/btrfs/transaction.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index d4e81a5..87004c0 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1877,6 +1877,14 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
>                 ret = cur_trans->aborted;
>                 goto scrub_continue;
>         }
> +
> +       /*
> +        * Wait for any new pending ordered extents that might have been added
> +        * by other tasks that were attached to this transaction while we were
> +        * waiting for cur_trans->num_writers to decrease to 1.
> +        */
> +       btrfs_wait_pending_ordered(cur_trans, root->fs_info);
> +
>         /*
>          * the reloc mutex makes sure that we stop
>          * the balancing code from coming in and moving
> --
> 2.1.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html