Re: journal_checksum_v2 on-disk format change? (was: Re: journal recovery problems with metadata_csum, *non-64bit*)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Doesn't a larger journal_tag_bytes() size mean more overhead for
the journal?

Cheers, Andreas

> On Aug 12, 2014, at 19:08, "Darrick J. Wong" <darrick.wong@xxxxxxxxxx> wrote:
> 
>> On Tue, Aug 12, 2014 at 01:39:35AM -0400, TR Reardon wrote:
>>> On 8/11/14, Darrick J. Wong <darrick.wong@xxxxxxxxxx> wrote:
>>> Hi all,
>>> 
>>> Mr. Reardon has discovered that due to a bug in journal_tag_bytes(), if the
>>> the
>>> "journal csum v2" feature flag is turned on, block tag records are being
>>> written with two extra bytes of space because we don't need to execute
>>> "x += sizeof(tag.t_checksum);".  In 32-bit mode, other parts of the journal
>>> then perform incorrect size comparisons, leading to BUG() being called.  In
>>> 64-bit mode, there's enough padding that bad things won't happen.
>>> 
>>> This is a remnant of the days when I tried to enlarge journal_block_tag_t
>>> to
>>> hold the full 32-bit checksum for a data block that's stored in the
>>> journal.
>>> Back in 2011, we decided (though sadly I can't find the link; I think we
>>> might
>>> have discussed this in the concall) that it was better not to change the
>>> size
>>> of journal_block_tag_t than it was to make it bigger so that it could hold
>>> the
>>> full checksum.
>>> 
>>> A simple fix for the problem has been proposed by Mr. Reardon which fixes
>>> journal_tag_bytes() and leaves everything else unchanged.  However, that is
>>> technically a disk format change since the size of journal_block_tag_t on
>>> disk
>>> changes, albeit only for people running experimental metadata_csum
>>> filesystems.
>>> Since we've been allocating disk space for the enlarged checksum this whole
>>> time, if we apply that patch, anyone with an unclean 64bit FS will not be
>>> able
>>> to recover the journal.  (Anyone with an unclean 32-bit FS has been broken
>>> the
>>> whole time, and still will be.)
>>> 
>>> The other thing we could do is actually use the two bytes to store the high
>>> 16-bits of the checksum, fix the jbd2 helper functions to reflect that
>>> reality
>>> (so that they don't BUG()), and change the checksum verify routine to pass
>>> the
>>> block if either the full checksum matches, or if the lower 16 bits match
>>> and
>>> the upper 16 bits are zero.  With this route, anybody with an uncleanly
>>> unmounted FS could still recover the journal, since we're not changing the
>>> size
>>> of anything.
>>> 
>>> Fortunately, journal tag blocks are fairly ephemeral and nobody ought to be
>>> using metadata_csum on a production filesystem, so at this point we can
>>> probably change the disk format without too many repercussions.  If you
>>> make
>>> sure to cleanly unmount the filesystem when changing kernel/e2fsprogs
>>> versions,
>>> there will be no problems.
>>> 
>>> So, uh... comments?  How should we proceed?
>>> 
>>> --D
>> 
>> My only concern is that legacy applies to in-the-wild kernels, not
>> just journals.  Any post 3.4 kernel has this problem, which will be
>> exposed as soon as e2fsprogs 1.43 is released.  A common (enough) use
>> case might be, say, a 2TB USB drive, being unplugged and replugged
>> across machines  without proper shutdown.  Old machines will think
>> they recognize the journal but don't actually, and replay something
>> destructive.
>> 
>> In other words, this is a future-retro problem.  The problem is not so
>> much with existing fs experimenting with metadata_csum, but a future
>> properly-journaled drive being plugged into an legacy faulty machine.
>> Those incompat flags are supposed to protect against just this
>> scenario, but won't.  So perhaps you'll need make a new
>> "INCOMPAT_CSUM_V3" flag to protect?  Actually, dwelling on this a bit
>> today, I think whether or not you retain those 2 extra checksum bytes
>> in final fix, you ought use a new flag for the format.
> 
> Definitely we need a new bit to keep the old kernels out; I wanted others to
> weigh in on whether we ought to shrink the struct or put the two bytes to use.
> 
> That said... journal recovery is not so well tested, so I'm currently busy
> building out debugfs to create journal transactions so we can test recovery
> both with and without *_csum.
> 
> --D
>> 
>> +Reardon
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Reiser Filesystem Development]     [Ceph FS]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite National Park]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Media]

  Powered by Linux