On Sat, Dec 06, 2014 at 02:41:24PM +0000, Filipe Manana wrote: > When we abort a transaction we iterate over all the ranges marked as dirty > in fs_info->freed_extents[0] and fs_info->freed_extents[1], clear them > from those trees, add them back (unpin) to the free space caches and, if > the fs was mounted with "-o discard", perform a discard on those regions. > Also, after adding the regions to the free space caches, a fitrim ioctl call > can see those ranges in a block group's free space cache and perform a discard > on the ranges, so the same issue can happen without "-o discard" as well. > > This causes corruption, affecting one or multiple btree nodes (in the worst > case leaving the fs unmountable) because some of those ranges (the ones in > the fs_info->pinned_extents tree) correspond to btree nodes/leafs that are > referred by the last committed super block - breaking the rule that anything > that was committed by a transaction is untouched until the next transaction > commits successfully. > > I ran into this while running in a loop (for several hours) the fstest that > I recently submitted: > > [PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim > > The corruption always happened when a transaction aborted and then fsck complained > like this: > > _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent > *** fsck.btrfs output *** > Check tree block failed, want=94945280, have=0 > Check tree block failed, want=94945280, have=0 > Check tree block failed, want=94945280, have=0 > Check tree block failed, want=94945280, have=0 > Check tree block failed, want=94945280, have=0 > read block failed check_tree_block > Couldn't open file system Ooh, this looks like the mystery bug we've seen a few times over the last 9 months. The have=0, combined with SSDs, is exactly the symptoms. I think you may have just relieved Josef of his current number-one btrfs bug. > In this case 94945280 corresponded to the root of a tree. > Using frace what I observed was the following sequence of steps happened: > > 1) transaction N started, fs_info->pinned_extents pointed to > fs_info->freed_extents[0]; > > 2) node/eb 94945280 is created; > > 3) eb is persisted to disk; > > 4) transaction N commit starts, fs_info->pinned_extents now points to > fs_info->freed_extents[1], and transaction N completes; > > 5) transaction N + 1 starts; > > 6) eb is COWed, and btrfs_free_tree_block() called for this eb; > > 7) eb range (94945280 to 94945280 + 16Kb) is added to > fs_info->pinned_extents (fs_info->freed_extents[1]); > > 8) Something goes wrong in transaction N + 1, like hitting ENOSPC > for example, and the transaction is aborted, turning the fs into > readonly mode. The stack trace I got for example: > > [112065.253935] [<ffffffff8140c7b6>] dump_stack+0x4d/0x66 > [112065.254271] [<ffffffff81042984>] warn_slowpath_common+0x7f/0x98 > [112065.254567] [<ffffffffa0325990>] ? __btrfs_abort_transaction+0x50/0x10b [btrfs] > [112065.261674] [<ffffffff810429e5>] warn_slowpath_fmt+0x48/0x50 > [112065.261922] [<ffffffffa032949e>] ? btrfs_free_path+0x26/0x29 [btrfs] > [112065.262211] [<ffffffffa0325990>] __btrfs_abort_transaction+0x50/0x10b [btrfs] > [112065.262545] [<ffffffffa036b1d6>] btrfs_remove_chunk+0x537/0x58b [btrfs] > [112065.262771] [<ffffffffa033840f>] btrfs_delete_unused_bgs+0x1de/0x21b [btrfs] > [112065.263105] [<ffffffffa0343106>] cleaner_kthread+0x100/0x12f [btrfs] > (...) > [112065.264493] ---[ end trace dd7903a975a31a08 ]--- > [112065.264673] BTRFS: error (device sdc) in btrfs_remove_chunk:2625: errno=-28 No space left > [112065.264997] BTRFS info (device sdc): forced readonly > > 9) The clear kthread sees that the BTRFS_FS_STATE_ERROR bit is set in > fs_info->fs_state and calls btrfs_cleanup_transaction(), which in > turn calls btrfs_destroy_pinned_extent(); > > 10) Then btrfs_destroy_pinned_extent() iterates over all the ranges > marked as dirty in fs_info->freed_extents[], and for each one > it calls discard, if the fs was mounted with "-o discard", and > adds the range to the free space cache of the respective block > group; > > 11) btrfs_trim_block_group(), invoked from the fitrim ioctl code path, > sees the free space entries and performs a discard; > > 12) After an umount and mount (or fsck), our eb's location on disk was full > of zeroes, and it should have been untouched, because it was marked as > dirty in the fs_info->pinned_extents tree, and therefore used by the > trees that the last committed superblock points to. Wow. Nice one, Filipe. Hugo. -- Hugo Mills | Startle, startle, little twink. How I wonder what hugo@... carfax.org.uk | you think. http://carfax.org.uk/ | PGP: 65E74AC0 |
Attachment:
signature.asc
Description: Digital signature