Re: [PATCH] Btrfs: fix fs corruption on transaction abort if device supports discard

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Dec 06, 2014 at 02:41:24PM +0000, Filipe Manana wrote:
> When we abort a transaction we iterate over all the ranges marked as dirty
> in fs_info->freed_extents[0] and fs_info->freed_extents[1], clear them
> from those trees, add them back (unpin) to the free space caches and, if
> the fs was mounted with "-o discard", perform a discard on those regions.
> Also, after adding the regions to the free space caches, a fitrim ioctl call
> can see those ranges in a block group's free space cache and perform a discard
> on the ranges, so the same issue can happen without "-o discard" as well.
> 
> This causes corruption, affecting one or multiple btree nodes (in the worst
> case leaving the fs unmountable) because some of those ranges (the ones in
> the fs_info->pinned_extents tree) correspond to btree nodes/leafs that are
> referred by the last committed super block - breaking the rule that anything
> that was committed by a transaction is untouched until the next transaction
> commits successfully.
> 
> I ran into this while running in a loop (for several hours) the fstest that
> I recently submitted:
> 
>   [PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim
> 
> The corruption always happened when a transaction aborted and then fsck complained
> like this:
> 
>    _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
>    *** fsck.btrfs output ***
>    Check tree block failed, want=94945280, have=0
>    Check tree block failed, want=94945280, have=0
>    Check tree block failed, want=94945280, have=0
>    Check tree block failed, want=94945280, have=0
>    Check tree block failed, want=94945280, have=0
>    read block failed check_tree_block
>    Couldn't open file system

   Ooh, this looks like the mystery bug we've seen a few times over
the last 9 months. The have=0, combined with SSDs, is exactly the
symptoms. I think you may have just relieved Josef of his current
number-one btrfs bug.

> In this case 94945280 corresponded to the root of a tree.
> Using frace what I observed was the following sequence of steps happened:
> 
>    1) transaction N started, fs_info->pinned_extents pointed to
>       fs_info->freed_extents[0];
> 
>    2) node/eb 94945280 is created;
> 
>    3) eb is persisted to disk;
> 
>    4) transaction N commit starts, fs_info->pinned_extents now points to
>       fs_info->freed_extents[1], and transaction N completes;
> 
>    5) transaction N + 1 starts;
> 
>    6) eb is COWed, and btrfs_free_tree_block() called for this eb;
> 
>    7) eb range (94945280 to 94945280 + 16Kb) is added to
>       fs_info->pinned_extents (fs_info->freed_extents[1]);
> 
>    8) Something goes wrong in transaction N + 1, like hitting ENOSPC
>       for example, and the transaction is aborted, turning the fs into
>       readonly mode. The stack trace I got for example:
> 
>       [112065.253935]  [<ffffffff8140c7b6>] dump_stack+0x4d/0x66
>       [112065.254271]  [<ffffffff81042984>] warn_slowpath_common+0x7f/0x98
>       [112065.254567]  [<ffffffffa0325990>] ? __btrfs_abort_transaction+0x50/0x10b [btrfs]
>       [112065.261674]  [<ffffffff810429e5>] warn_slowpath_fmt+0x48/0x50
>       [112065.261922]  [<ffffffffa032949e>] ? btrfs_free_path+0x26/0x29 [btrfs]
>       [112065.262211]  [<ffffffffa0325990>] __btrfs_abort_transaction+0x50/0x10b [btrfs]
>       [112065.262545]  [<ffffffffa036b1d6>] btrfs_remove_chunk+0x537/0x58b [btrfs]
>       [112065.262771]  [<ffffffffa033840f>] btrfs_delete_unused_bgs+0x1de/0x21b [btrfs]
>       [112065.263105]  [<ffffffffa0343106>] cleaner_kthread+0x100/0x12f [btrfs]
>       (...)
>       [112065.264493] ---[ end trace dd7903a975a31a08 ]---
>       [112065.264673] BTRFS: error (device sdc) in btrfs_remove_chunk:2625: errno=-28 No space left
>       [112065.264997] BTRFS info (device sdc): forced readonly
> 
>    9) The clear kthread sees that the BTRFS_FS_STATE_ERROR bit is set in
>       fs_info->fs_state and calls btrfs_cleanup_transaction(), which in
>       turn calls btrfs_destroy_pinned_extent();
> 
>    10) Then btrfs_destroy_pinned_extent() iterates over all the ranges
>        marked as dirty in fs_info->freed_extents[], and for each one
>        it calls discard, if the fs was mounted with "-o discard", and
>        adds the range to the free space cache of the respective block
>        group;
> 
>    11) btrfs_trim_block_group(), invoked from the fitrim ioctl code path,
>        sees the free space entries and performs a discard;
> 
>    12) After an umount and mount (or fsck), our eb's location on disk was full
>        of zeroes, and it should have been untouched, because it was marked as
>        dirty in the fs_info->pinned_extents tree, and therefore used by the
>        trees that the last committed superblock points to.

   Wow. Nice one, Filipe.

   Hugo.

-- 
Hugo Mills             | Startle, startle, little twink. How I wonder what
hugo@... carfax.org.uk | you think.
http://carfax.org.uk/  |
PGP: 65E74AC0          |

Attachment: signature.asc
Description: Digital signature


[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]