Re: [PATCH] jbd2: Set the right uuid for block tag

"Theodore Ts'o" <tytso@xxxxxxx> · Fri, 15 Jul 2022 13:43:07 -0400

On Fri, Jul 15, 2022 at 11:11:23PM +0800, Wang Jianjian wrote:
> Hi all,
> Is this a real problem need to fix ?
> 
> On 7/12/22 00:26, Wang Jianjian wrote:
> > journal->j_uuid is not initialized and let us use the uuid from
> > j_superblock. And since this is the only place where j_uuid is used
> > so that we can remove it.

There's a really long story, and the short version is, we don't really
need to do anything.

The longer version of the story is that journal->j_uuid is not
supposed to be the uuid from the journal superblock, but rather the
uuid from the file system.  Quoting from the jbd2.h header file:

	/**
	 * @j_uuid:
	 *
	 * Journal uuid: identifies the object (filesystem, LVM volume etc)
	 * backed by this journal.  This will eventually be replaced by an array
	 * of uuids, allowing us to index multiple devices within a single
	 * journal and to perform atomic updates across them.
	 */
	__u8			j_uuid[16];

The original design goal from Stephen Tweedie, who implemented the jbd
subsystem for ext3, was that the jbd/jbd2 layer could in theory be
used for more than just ext3/ext4 (and indeed, it is used by
ocfs/ocfs2), *and*, that in the case of a journal on an external
device, that a single jbd/jbd2 journal could support multiple file
systems.  So you might have a dozen HDD's in a NAS box, and they would
all use a single extrenal device as a journal.

So at the beginning of each block (or revoke) tag, there is a space
for a UUID to indicate what file system that the metadata blocks being
journaled was for.  Those bits are skipped if the JBD2_FLAG_SAME_UUID
is set, in which case the tag is assumed to belong to the same file
system as the previous tag.

The on-disk space has been reserved for this design; we have
JBD2_FLAG_SAME_UUID, and that's why we copy the j_uuid into the tag
block for the first tag, and all other tags gave the SAME_UUID flag
set.  In addition, in the on-disk superblock, we define an array of
UUID's which are the file systems which use this particular external
journal:

/* 0x0100 */
	__u8	s_users[16*48];		/* ids of all fs'es sharing the log */

However, full support for having multiple file systems sharing the
long was never realized.  Part of the reason for this is there are
complications if one of the disks is off-line at the time that the
journal is replayed.  Suppose out of the dozen file systems sharing an
external journal, one of the HDD's is temporarily off-line, or removed
from the NAS box.  Now we can't replay the journal, since there is no
place to put the journal enrties for the missing file system(s).  In
theory we could copy the journal entries for the missing file system
in an file, but then we would have to define some place on the root
file system where the saved journals for the missing HDD could be
found, and that assumes that the root file system is mounted
read/write so it's available to save the journal information.

Another potential problem is that when we *do* need to do a commit, we
have to wait for handles for *all* of the file systems using that
journal to complete, and so an fsync() triggered on one file system
would potentially hold up file system operations on a sibling device.

There might be cases where this would make sense, they would be pretty
specialized situations, and so never has ever decided to implement the
rest of the feature.

At the moment journal->j_uuid is all zeros after
journal_init_common(), and so we're wasting 16 bytes in each journal
descriptor block by filling in those bytes with all zeroes.  The
proper way to "fix" this would be in fs/ext4/super.c, in the functions
ext4_get_inode() and ext4_get_dev_journal(), after those functions
call jbd2_journal_init_{dev,inode}, they should copy
EXT4_SB(super)->s_uuid into journal->j_uuid.

That would properly "initialize" journal->j_uuid, and it would mean
that the file system uuid would be in the tag block.  One potential
gotcha if we were to make this change, and if we wanted to actually
check the uuid in the jbd2 descriptor block, is that today, tune2fs,
and the proposed EXT4_IOC_SETUUID ioctl assume that we can change the
file system uuid without having any potential problems.

If we wanted to properly support this case in the EXT4_IOC_SETUUID
patch, we would have to freeze the file system by calling
ext4_freeze() and then update journal->j_uuid, as well as update
jsb->s_users[] if we are using an external journal.

And for the existing userspace-only "tune2fs -U" code, we currently
don't have a way of triggering an update of the journal->j_uuid field
when the file system uuid is changed.  (We do update the journal
superblock s_users array, but if we did that while the file system was
mounted, it wouldn't be safe unless it called FIFREEZE/FITHAW, which
it currently doesn't do.)

So we could initialize journal->j_uuid if we wanted to, which wouldn't
do much harm, but it wouldn't really fix anything either --- and then
we'd have to make sure that journal is quicesed, and journal->j_uuid
gets updated if the file system UUID is changed while the file system
is mounted.  Once the first tag in the jbd2 descriptor block is now
set correctly, we would then have to find a profitable way to *use*
that information --- either as an additional sanity check on top of
the checksum, or to implement the full support for shared uses for the
journal.  But since the journal checksums are good enough that we
don't need the additional validation, and there hasn't been a real
demand for shared external journals, it probably won't happen until
someone wants to make a business case and fund the engineering work to
make it happen.

Cheers,

						- Ted