On Mon, Feb 9, 2015 at 12:21 PM, Filipe Manana <fdmanana@xxxxxxxx> wrote: > There's a short time window where a race can happen between two or more > tasks that hold a transaction handle for the same transaction and where > one starts the transaction commit before the other tasks attempt to > split their pending ordered extents list into the transaction's pending > ordered extents lists. This results in the transaction commit not waiting > for those ordered extents to complete, in memory leaks of ordered extent > structures and therefore inode leaks too, since an iput for the ordered > extent's inode is done only when the ordered extent's refcount drops to > zero. This race is described by the following sequence diagram: > > CPU 1 CPU 2 > > btrfs_start_transaction() > started transaction N with > trans->transaction->num_writers == 1 > and trans->transaction->state == > TRANS_STATE_RUNNING > > btrfs_sync_file() > > btrfs_start_transaction() > --> returns transaction > handle pointing to > transaction N > --> Now transaction N's > num_writers == 2 > > btrfs_sync_log() > > btrfs_commit_transaction() > btrfs_wait_pending_ordered() > --> transaction N's ->pending_ordered > processed and is now an empty list > set transaction state to TRANS_STATE_COMMIT_DOING > wait for trans->transaction->num_writers == 1 > > btrfs_wait_logged_extents() > --> adds ordered extents > to trans->ordered list > > btrfs_end_transaction() > --> trans->ordered list is spliced > into transaction N's list > pending_ordered > --> transaction N's num_writers > becomes 1 now > > wait finished, num_writers == 1 > transaction is committed and it doesn't wait > for the ordered extents from CPU 2's task to > complete, nor does it decrement their last > reference, resulting in memory leaks and > inode leaks (the iput on the ordered extent's > inode is done only when the ordered extent's > refcount drops to zero) > > So fix this by processing the transaction's pending_ordered list again > after the number of writers decreases to 1. > > I ran into this issue while running xfstests/generic/113 in a loop, which > failed about 1 out of 10 runs with the following warning in dmesg: > > [ 2612.440038] WARNING: CPU: 4 PID: 22057 at fs/btrfs/disk-io.c:3558 free_fs_root+0x36/0x133 [btrfs]() > [ 2612.442810] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop processor parport_pc parport psmouse thermal_sys i2c_piix4 serio_raw pcspkr evdev microcode button i2c_core ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic virtio_pci ata_piix virtio_ring libata virtio floppy e1000 scsi_mod [last unloaded: btrfs] > [ 2612.452711] CPU: 4 PID: 22057 Comm: umount Tainted: G W 3.19.0-rc5-btrfs-next-4+ #1 > [ 2612.454921] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 > [ 2612.457709] 0000000000000009 ffff8801342c3c78 ffffffff8142425e ffff88023ec8f2d8 > [ 2612.459829] 0000000000000000 ffff8801342c3cb8 ffffffff81045308 ffff880046460000 > [ 2612.461564] ffffffffa036da56 ffff88003d07b000 ffff880046460000 ffff880046460068 > [ 2612.463163] Call Trace: > [ 2612.463719] [<ffffffff8142425e>] dump_stack+0x4c/0x65 > [ 2612.464789] [<ffffffff81045308>] warn_slowpath_common+0xa1/0xbb > [ 2612.466026] [<ffffffffa036da56>] ? free_fs_root+0x36/0x133 [btrfs] > [ 2612.467247] [<ffffffff810453c5>] warn_slowpath_null+0x1a/0x1c > [ 2612.468416] [<ffffffffa036da56>] free_fs_root+0x36/0x133 [btrfs] > [ 2612.469625] [<ffffffffa036f2a7>] btrfs_drop_and_free_fs_root+0x93/0x9b [btrfs] > [ 2612.471251] [<ffffffffa036f353>] btrfs_free_fs_roots+0xa4/0xd6 [btrfs] > [ 2612.472536] [<ffffffff8142612e>] ? wait_for_completion+0x24/0x26 > [ 2612.473742] [<ffffffffa0370bbc>] close_ctree+0x1f3/0x33c [btrfs] > [ 2612.475477] [<ffffffff81059d1d>] ? destroy_workqueue+0x148/0x1ba > [ 2612.476695] [<ffffffffa034e3da>] btrfs_put_super+0x19/0x1b [btrfs] > [ 2612.477911] [<ffffffff81153e53>] generic_shutdown_super+0x73/0xef > [ 2612.479106] [<ffffffff811540e2>] kill_anon_super+0x13/0x1e > [ 2612.480226] [<ffffffffa034e1e3>] btrfs_kill_super+0x17/0x23 [btrfs] > [ 2612.481471] [<ffffffff81154307>] deactivate_locked_super+0x3b/0x50 > [ 2612.482686] [<ffffffff811547a7>] deactivate_super+0x3f/0x43 > [ 2612.483791] [<ffffffff8116b3ed>] cleanup_mnt+0x59/0x78 > [ 2612.484842] [<ffffffff8116b44c>] __cleanup_mnt+0x12/0x14 > [ 2612.485900] [<ffffffff8105d019>] task_work_run+0x8f/0xbc > [ 2612.486960] [<ffffffff810028d8>] do_notify_resume+0x5a/0x6b > [ 2612.488083] [<ffffffff81236e5b>] ? trace_hardirqs_on_thunk+0x3a/0x3f > [ 2612.489333] [<ffffffff8142a17f>] int_signal+0x12/0x17 > [ 2612.490353] ---[ end trace 54a960a6bdcb8d93 ]--- > [ 2612.557253] VFS: Busy inodes after unmount of sdb. Self-destruct in 5 seconds. Have a nice day... > > This happened because evict_inodes (VFS) skipped eviction of one inode > because that inode had an i_count > 0. This prevented btrfs_destroy_inode > from being called against that inode, and the non-zero i_count was because > one ordered extent didn't get its refcount dropped to 0, which is when a > delayed iput is scheduled for the inode. Kmemleak also complained about > this, namely the leak of the ordered extent structure and btrfs inode > specific structures, like the delayed inode: > > $ cat /sys/kernel/debug/kmemleak > unreferenced object 0xffff880154290db0 (size 576): > comm "btrfsck", pid 21980, jiffies 4295542503 (age 1273.412s) > hex dump (first 32 bytes): > 01 40 00 00 01 00 00 00 b0 1d f1 4e 01 88 ff ff .@.........N.... > 00 00 00 00 00 00 00 00 c8 0d 29 54 01 88 ff ff ..........)T.... > backtrace: > [<ffffffff8141d74d>] kmemleak_update_trace+0x4c/0x6a > [<ffffffff8122f2c0>] radix_tree_node_alloc+0x6d/0x83 > [<ffffffff8122fb26>] __radix_tree_create+0x109/0x190 > [<ffffffff8122fbdd>] radix_tree_insert+0x30/0xac > [<ffffffffa03b9bde>] btrfs_get_or_create_delayed_node+0x130/0x187 [btrfs] > [<ffffffffa03bb82d>] btrfs_delayed_delete_inode_ref+0x32/0xac [btrfs] > [<ffffffffa0379dae>] __btrfs_unlink_inode+0xee/0x288 [btrfs] > [<ffffffffa037c715>] btrfs_unlink_inode+0x1e/0x40 [btrfs] > [<ffffffffa037c797>] btrfs_unlink+0x60/0x9b [btrfs] > [<ffffffff8115d7f0>] vfs_unlink+0x9c/0xed > [<ffffffff8115f5de>] do_unlinkat+0x12c/0x1fa > [<ffffffff811601a7>] SyS_unlinkat+0x29/0x2b > [<ffffffff81429e92>] system_call_fastpath+0x12/0x17 > [<ffffffffffffffff>] 0xffffffffffffffff > unreferenced object 0xffff88014ef11db0 (size 576): > comm "rm", pid 22009, jiffies 4295542593 (age 1273.052s) > hex dump (first 32 bytes): > 02 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................ > 00 00 00 00 00 00 00 00 c8 1d f1 4e 01 88 ff ff ...........N.... > backtrace: > [<ffffffff8141d74d>] kmemleak_update_trace+0x4c/0x6a > [<ffffffff8122f2c0>] radix_tree_node_alloc+0x6d/0x83 > [<ffffffff8122fb26>] __radix_tree_create+0x109/0x190 > [<ffffffff8122fbdd>] radix_tree_insert+0x30/0xac > [<ffffffffa03b9bde>] btrfs_get_or_create_delayed_node+0x130/0x187 [btrfs] > [<ffffffffa03bb82d>] btrfs_delayed_delete_inode_ref+0x32/0xac [btrfs] > [<ffffffffa0379dae>] __btrfs_unlink_inode+0xee/0x288 [btrfs] > [<ffffffffa037c715>] btrfs_unlink_inode+0x1e/0x40 [btrfs] > [<ffffffffa037c797>] btrfs_unlink+0x60/0x9b [btrfs] > [<ffffffff8115d7f0>] vfs_unlink+0x9c/0xed > [<ffffffff8115f5de>] do_unlinkat+0x12c/0x1fa > [<ffffffff811601a7>] SyS_unlinkat+0x29/0x2b > [<ffffffff81429e92>] system_call_fastpath+0x12/0x17 > [<ffffffffffffffff>] 0xffffffffffffffff > unreferenced object 0xffff8800336feda8 (size 584): > comm "aio-stress", pid 22031, jiffies 4295543006 (age 1271.400s) > hex dump (first 32 bytes): > 00 40 3e 00 00 00 00 00 00 00 8f 42 00 00 00 00 .@>........B.... > 00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00 ................ > backtrace: > [<ffffffff8114eb34>] create_object+0x172/0x29a > [<ffffffff8141d790>] kmemleak_alloc+0x25/0x41 > [<ffffffff81141ae6>] kmemleak_alloc_recursive.constprop.52+0x16/0x18 > [<ffffffff81145288>] kmem_cache_alloc+0xf7/0x198 > [<ffffffffa0389243>] __btrfs_add_ordered_extent+0x43/0x309 [btrfs] > [<ffffffffa038968b>] btrfs_add_ordered_extent_dio+0x12/0x14 [btrfs] > [<ffffffffa03810e2>] btrfs_get_blocks_direct+0x3ef/0x571 [btrfs] > [<ffffffff81181349>] do_blockdev_direct_IO+0x62a/0xb47 > [<ffffffff8118189a>] __blockdev_direct_IO+0x34/0x36 > [<ffffffffa03776e5>] btrfs_direct_IO+0x16a/0x1e8 [btrfs] > [<ffffffff81100373>] generic_file_direct_write+0xb8/0x12d > [<ffffffffa038615c>] btrfs_file_write_iter+0x24b/0x42f [btrfs] > [<ffffffff8118bb0d>] aio_run_iocb+0x2b7/0x32e > [<ffffffff8118c99a>] do_io_submit+0x26e/0x2ff > [<ffffffff8118ca3b>] SyS_io_submit+0x10/0x12 > [<ffffffff81429e92>] system_call_fastpath+0x12/0x17 > > This race was introduced with the following change (added in 3.19 and > backported to stable 3.18 and 3.17): > > Btrfs: make sure logged extents complete in the current transaction V3 > commit 50d9aa99bd35c77200e0e3dd7a72274f8304701f > > CC: <stable@xxxxxxxxxxxxxxx> # 3.19, 3.18 and 3.17 > Signed-off-by: Filipe Manana <fdmanana@xxxxxxxx> > --- Please ignore this patch. It accomplishes nothing actually, and the problem is entirely solved by my other patch titled: "Btrfs: fix fsync race leading to ordered extent memory leaks" (https://patchwork.kernel.org/patch/5802281/) > > V2: Changed approach to leave the current wait for pending ordered > extents before waiting for the transaction's number of writers to > decrease to 1 and added a second wait after to process any ordered > extents that might have been added by other tasks while waiting > for the number of writers to decrease to 1. This helps reducing > the amount of time the transaction is in the state > TRANS_STATE_COMMIT_DOING. > > fs/btrfs/transaction.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c > index d4e81a5..87004c0 100644 > --- a/fs/btrfs/transaction.c > +++ b/fs/btrfs/transaction.c > @@ -1877,6 +1877,14 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans, > ret = cur_trans->aborted; > goto scrub_continue; > } > + > + /* > + * Wait for any new pending ordered extents that might have been added > + * by other tasks that were attached to this transaction while we were > + * waiting for cur_trans->num_writers to decrease to 1. > + */ > + btrfs_wait_pending_ordered(cur_trans, root->fs_info); > + > /* > * the reloc mutex makes sure that we stop > * the balancing code from coming in and moving > -- > 2.1.3 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, "Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men." -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html