Re: [PATCH v2] hfsplus: add journal replay

Vyacheslav Dubeyko <slava@xxxxxxxxxxx> · Sun, 09 Mar 2014 20:36:47 +0400

Hi Sergei,

On Mon, 2014-02-24 at 11:05 +0100, Sergei Antonov wrote:

We have really different vision of implementation for this functionality
and I have many remarks about your implementation. So, I suggest to
discuss principal points, firstly. We need to achieve understanding of
each other and, then, it makes sense to implement joint solution.
Currently, both of our implementations have shortcomings, from my point
of view. 

> The replay code checks the integrity of the whole journal before actually
> replaying sectors. If any error (e.g. checksum mismatch, invalid/misaligned
> offsets) is detected, the volume will remain unchanged and the journal will not
> be cleared.

I think that such approach is overkill. And, moreover, this is a wrong
approach.

First of all, journal contains sequence of transactions. Technical Note
TN1150: "A group of related changes is called a transaction. When all of
the changes of a transaction have been written to their normal locations
on disk, that transaction has been committed, and is removed from the
journal. The journal may contain several transactions. Copying changes
from all transactions to their normal locations on disk is called
replaying the journal". So, transactions can be committed one after
another and it leaves HFS+ volume in consistent state after each commit.
It doesn't make sense to check all transactions preliminary and only
then to commit.

Secondly, for example, you had sudden power-off. It means that last
transaction will be not finished. In other words, simply broken or
invalid. In your approach you will reject the all transactions in the
journal. But the goal of the journal is to commit all valid transactions
before invalid one. Otherwise, file system remains in corrupted state
without journal replay.

And, finally, if you firstly check all transactions then it means that
you need to read all of its in memory before real replaying. And, as far
as I can judge, you simply keep all sectors in memory before replay. For
example, your system has 512 MB of RAM and HFS+ volume has fully filled
journal with 512 MB in size. How lucky will you with journal replay on
such system without a swap?

So, I use another approach in my code. I check transaction and then to
replay it. And I think that transaction based journal replay is
reasonable and right way. 

> Tested with:
>  drives with 512 and 4K sector sizes

I've tested my code with 512 bytes, 1 KB, 2 KB, 4 KB, 8 KB, 16 KB, 32
KB, 64 KB block sizes. And I think that such testing is really
necessary. I've found and fix several bugs namely for block sizes are
greater than 4 KB. Your code is really complex and I am afraid that 8 -
64 KB block sizes will reveal some bugs.

>  default (8 MB) and custom (1 MB, 500 MB) journal sizes
>  big-endian (PowerPC) and little-endian (Intel) journal byte sexes

My last version of the patchset is ready for as little-endian as
big-endian journal. So, I can't see difference in our patches here.

> Advantages over a previously presented
>    "[PATCH v4 00/15] hfsplus: introduce journal replay functionality"
>    by a different author:

My name is not "different author". My name is Vyacheslav Dubeyko. If you
want to insult me then you had achieved the goal.

> * 12 times faster mount with replay (0.5s versus 6s on my test volume)

OK. Let's talk about it.

I had thought during implementation about more optimal bio submission.
But I simply had chosen hfsplus_submit_bio() method. Because journal
replay is not so frequent operation that it takes place only during
mount operation. Usually, the journal contains several transactions
after clean umount (or it can be simply empty). It can contain many
transaction in the case of sudden power-off, for example. But such
situation is not so frequent and it is more important for a end-user to
have reliable journal replay. I suppose that every user is ready to wait
a spare second (or even a minute) for achieving consistent state of a
volume after mount. Because journal replay has goal to guarantee file
system consistency but not performance.

OK. I believe also that performance is important. But you had suggested
really big and complicated implementation that, moreover, it is really
hard to understand at one glance. I can suggest much, much and much
simpler solution. First of all, transaction based journal replay is my
principal position. Secondly, it is possible to submit reading of all
transaction's blocks before starting to process transaction. Such
submission can be made by means of optimized version of
hfsplus_submit_bio() or likewise analog. Thirdly, moreover, it is
possible to calculate block's checksum in bio->bi_end_io method for the
case of block sizes lesser than 8 KB (512, 1 KB, 2 KB, 4 KB). Such
method is called at the end of block processing by block layer.

> * supports 4K sector devices

Could you point out where do you make special efforts for it? And why do
you need to implement something in this direction?

I tried to find likewise implementation in your code but I failed. I
suspect that I misunderstand something.

> * no redundant stuff like journal initialization

My patchset is first step in implementation of full-featured journaling
support functionality for HFS+ driver. So, to implement journal
initialization functionality with journal replay functionality is
reasonable action. Because these two features are related to each
others, from my viewpoint.

Of course, it is possible not initialize journal during mount on current
step of implementation. But to implement the journal initialization
functionality on the step of implementation of journal replay
functionality is really necessary.

But, I think that it needs to initialize journal during mount of HFS+
volume (for the case of kJIJournalNeedInitMask flag). Because it will
coincide with Technical Note TN1150. And if we have bugs in this
functionality then it needs to fix it. Anyway, if we want to have
full-featured journaling support functionality for HFS+ driver then we
need to implement the journal initialization functionality.

> * no fanatical use of macros with parameters

I disagree that I use macros with fanaticism. I agree that it is really
bad taste to use macros in C++. But it is common technique for C
language. And this technique is widely used in Linux kernel. So, it is
not my invention to use macros.

Moreover, macros in my pathset are simple code patterns that it can be
seen as simple abbreviation. For example:

#define JIB_BLOCK(sb) \
        (be32_to_cpu(HFSPLUS_SB(sb)->s_vhdr->journal_info_block))

I am unable to see anything bad in such macro. It shortens and to make
more sensible a code in the place of using.

But why do you use with fanaticism such macros as hfs_dbg(),
be32_to_cpu() and so on? Stop using macros in your patch if you appeal
about macro abstinence.

> * shorter, easier to review

Really? :)

It is really hard to review your patch for me. Because it is big,
complex and it doesn't be divided on logical parts.

Your patch contains 1015 lines of code at whole.

My patch contains 1598 lines of code at whole. If I exclude: (1) several
big comments - 99 lines, (2) journal initialization functionality - ~192
lines, (3) "norecovery" mount option functionality - ~134 lines, then, I
will have about 1173 lines of code in my patchset. 

So, 1015 lines of code vs. 1173 lines. It is not so big difference. And
if you adopt my comments ~100 lines (I believe that my comments are
really reasonable) and "norecovery" mount option ~100 lines of code then
the difference will vanish completely.

> @@ -166,6 +167,10 @@ struct hfsplus_sb_info {
>  	u32 total_blocks;
>  	u32 data_clump_blocks, rsrc_clump_blocks;
>  
> +	/* immutable data on the placement of the journal (if there is one) */
> +	u64 journal_offset;	/* relative to volume start */
> +	u64 journal_size;	/* relative to volume start */
> +

My patchset is the first step in implementation of full featured journal
support in HFS+ driver. So, I had suggested in my patchset to store in
superblock info such structure is related to journal info:

struct hfsplus_journal {
        struct mutex jnl_lock;

        /* Journal info block specific */
        void *jib_buf;
        struct hfsplus_journal_info_block *jib;

        /* Journal header specific */
        void *jh_buf;
        struct hfsplus_journal_header *jh;
        u8 endianness;

        /* TODO: Pointer of JBD */

        struct super_block *sbp;
};

I suppose that such structure is reasonable and it contains all
necessary starting details for journaling on HFS+ volume.

Otherwise, if we are talking about journal replay only, then, I don't
see any necessity in keeping any info about journal. In such case, it
doesn't makes sense to add in superblock info structure something is
related to journal details. Journal replaying is all-sufficient
operation.

>  
> +/* journal.c */
> +int hfsplus_jrnl_replay(struct super_block *, u32, bool *);

I really dislike "jrnl" abbreviation. Using "journal" instead of "jrnl"
is really reasonable and sensible way. I don't think that it makes sense
to economize on 3 symbols.

I tried to understand how you process binfo array. But it is really hard
to understand the whole patch. So, I simply ask. One block list (binfo
array) can have size from 4KB till 16KB. Thereby, block list can be
placed in several file system blocks. I have implemented and have tested
special functionality for it. How your patch is ready for situation when
binfo array will be placed in several blocks?

> +
>  /* time macros */
>  #define __hfsp_mt2ut(t)		(be32_to_cpu(t) - 2082844800U)
>  #define __hfsp_ut2mt(t)		(cpu_to_be32(t + 2082844800U))
> diff --git a/fs/hfsplus/hfsplus_raw.h b/fs/hfsplus/hfsplus_raw.h
> index 8ffb3a8..73ba1b6 100644
> --- a/fs/hfsplus/hfsplus_raw.h
> +++ b/fs/hfsplus/hfsplus_raw.h
> @@ -105,7 +105,7 @@ struct hfsplus_vh {
>  	__be16 version;
>  	__be32 attributes;
>  	__be32 last_mount_vers;
> -	u32 reserved;
> +	__be32 jrnl_info_block;
>  
>  	__be32 create_date;
>  	__be32 modify_date;
> @@ -145,6 +145,69 @@ struct hfsplus_vh {
>  #define HFSPLUS_VOL_JOURNALED		(1 << 13)
>  #define HFSPLUS_VOL_SOFTLOCK		(1 << 15)
>  
> +/* Journal info */
> +struct hfs_jrnl_info {
> +	__be32 flags;
> +	u32 unused[8];
> +	__be64 journal_offset;	/* Misaligned! relative to volume start */
> +	__be64 journal_size;	/* Misaligned! relative to volume start */
> +} __packed;
> +
> +/* Journal info flags */
> +#define HFSPLUS_JRNL_IN_FS		1
> +#define HFSPLUS_JRNL_ON_OTHER_DEVICE	2
> +#define HFSPLUS_JRNL_NEED_INIT		4
> +
> +/* Journal header */
> +struct hfs_jrnl_header {
> +	u32 magic;
> +	u32 endian;
> +	u64 data_begin;		/* relative to journal start */
> +	u64 data_end;		/* relative to journal start */
> +	u64 journal_size;
> +	u32 list_size;		/* size of hfs_jrnl_list */
> +	u32 checksum;
> +	u32 hdr_size;		/* Header size. Tells us where journal buffer
> +				 * begins. Also works as journal block size. */
> +	u32 last_seq_num;		/* transaction sequence number */
> +};
> +
> +#define HFSPLUS_JRNL_HDR_MAGIC	0x4a4e4c78
> +#define HFSPLUS_JRNL_HDR_ENDIAN	0x12345678
> +
> +/* Journal data run. Describes sequential data stored in the journal. */
> +struct hfs_jrnl_run {
> +	u64 first_block;	/* Block number the run begins at. Note that it
> +				 * is journal blocks, not filesystem blocks.
> +				 * Value ~0 means the run should be skipped. */
> +	u32 byte_len;		/* Data size in bytes. Multiple of
> +				 * journal block size. */
> +	u32 checksum;		/* Data checksum. */
> +};
> +
> +/*
> + * Journal list. Describes one or more journaled data runs.
> + * It is normally a single transaction.
> + */
> +struct hfs_jrnl_list {
> +	u16 reserved;
> +	u16 count;			/* number of runs plus 1 */
> +	u32 length_with_data;		/* length of the list and data */
> +	u32 checksum;			/* checksum of the first 32 bytes */
> +	u32 flags;			/* see possible flags below */
> +	u64 reserved1;			/* unused part of 1st fake run */
> +	u32 reserved2;			/* unused part of 1st fake run */
> +	u32 tr_end_or_seq_num;		/* Before sequence numbers introduction:
> +					   zero value means end of transaction.
> +					   After sequence numbers introduction:
> +					   a non-zero sequence number. */
> +	struct hfs_jrnl_run runs[0];	/* number of elements is count-1 */
> +};
> +

I really dislike the idea:
(1) to name structures in really different manner as in Technical Note
TN1150;
(2) to name fields of structures in really different manner as in
Technical Note TN1150;
(3) to distort a structure with comparing with Technical Note TN1150 (I
mean struct block_list_header, for example).

In such way you simply confuse everybody. Because your code becomes
hardly understandable and obscure. From another point of view, you added
last_seq_num field in journal header. But, actually, you don't use this
field for comparing this value with value in a last transaction. So,
declarations in  my patchset are much better.

Moreover, usually, it is used __le or __be types for declarations of
on-disk layout's types. Yes, I understand that HFS+ journal can be as
little-endian as big-endian. And I think that Intel architecture is more
frequent case now. So, it makes sense to use __le types for on-disk
layout HFS+ journal declarations (in the case journal header, block list
header and block info).

I really dislike idea to use u16, u32 and u64 types for on-disk layout
declarations. It's really confusing and it doesn't provide any profit.
There are four possible principal situations:
(1) LE system <-> LE journal;
(2) BE system <-> LE journal;
(3) LE system <-> BE journal;
(4) BE system <-> BE journal.
So, you need to use xx_to_cpu() and cpu_to_xx() family of methods,
anyway.

> diff --git a/fs/hfsplus/super.c b/fs/hfsplus/super.c
> index 80875aa..84b832e 100644
> --- a/fs/hfsplus/super.c
> +++ b/fs/hfsplus/super.c
> @@ -403,6 +403,7 @@ static int hfsplus_fill_super(struct super_block *sb, void *data, int silent)
>  		goto out_unload_nls;
>  	}
>  
> +reread_vhdr:
>  	/* Grab the volume header */
>  	if (hfsplus_read_wrapper(sb)) {

OK. I have the bug here in my patchset. But it can be easy to fix by
adoption your approach here. 

>  		if (!silent)
> @@ -447,6 +448,31 @@ static int hfsplus_fill_super(struct super_block *sb, void *data, int silent)
>  	sb->s_op = &hfsplus_sops;
>  	sb->s_maxbytes = MAX_LFS_FILESIZE;
>  
> +	if ((vhdr->attributes & cpu_to_be32(HFSPLUS_VOL_JOURNALED)) &&
> +	    !(sb->s_flags & MS_RDONLY)) {
> +		bool journal_was_empty;
> +		int replay_error;
> +
> +		pr_warn("write access to a journaled filesystem is not supported, use the force option at your own risk, mounting read-only.\n");
> +		sb->s_flags |= MS_RDONLY;
> +
> +		replay_error = hfsplus_jrnl_replay(sb,
> +			be32_to_cpu(vhdr->jrnl_info_block), &journal_was_empty);
> +		if (replay_error)
> +			pr_err("journal replay error %d\n", replay_error);
> +
> +		/*
> +		 * Reread volume header after replay, no matter if it succeeded
> +		 * or not. We've just set read-only flag, so no further replay
> +		 * attempts will happen.
> +		 */
> +		kfree(sbi->s_vhdr_buf);
> +		kfree(sbi->s_backup_vhdr_buf);
> +		sbi->s_vhdr_buf = 0;
> +		sbi->s_backup_vhdr_buf = 0;
> +		goto reread_vhdr;
> +	}
> +
>  	if (!(vhdr->attributes & cpu_to_be32(HFSPLUS_VOL_UNMNT))) {
>  		pr_warn("Filesystem was not cleanly unmounted, running fsck.hfsplus is recommended.  mounting read-only.\n");
>  		sb->s_flags |= MS_RDONLY;
> @@ -455,10 +481,6 @@ static int hfsplus_fill_super(struct super_block *sb, void *data, int silent)
>  	} else if (vhdr->attributes & cpu_to_be32(HFSPLUS_VOL_SOFTLOCK)) {
>  		pr_warn("Filesystem is marked locked, mounting read-only.\n");
>  		sb->s_flags |= MS_RDONLY;
> -	} else if ((vhdr->attributes & cpu_to_be32(HFSPLUS_VOL_JOURNALED)) &&
> -			!(sb->s_flags & MS_RDONLY)) {
> -		pr_warn("write access to a journaled filesystem is not supported, use the force option at your own risk, mounting read-only.\n");
> -		sb->s_flags |= MS_RDONLY;

Simple removing this lines is wrong. It needs to rework mount options.

Adding of journal replay means necessity to rework mount options. The
"force" mount option was suggested because absence of journal replay
functionality. So, this reworking is really necessary. And I had
reworked mount options in my patchset by means of deleting of "force"
and adding of "norecovery" mount options. Otherwise, leaving mount
options untouched looks like a schizophrenia.

[TO RESUME]

I have such principal points:
(1) On-disk layout declarations should live in peace with Technical Note
TN1150;
(2) Journal replay should be transaction based. I mean that it needs to
check transaction and to commit one by one.
(3) "Force" mount option should be changed on "norecovery" mount option.
(4) Journal initialization should be implemented.

As a result, I think that my patchset is much better as starting point
for joint implementation with you.

I suggest:
(1) To fix in my patchset bugs and shortcomings by means of adoption of
some solutions from your patch (reread superblock, using
get_unaligned_be64() and so on);
(2) Journal replay should be transaction based. So, it makes sense to
implement optimization of transaction processing by means of read-ahead
of transaction's blocks before transaction processing. Such submission
can be made by means of optimized version of hfsplus_submit_bio() or
likewise analog. Moreover, it is possible to calculate block's checksum
in bio->bi_end_io method for the case of block sizes lesser than 8 KB
(512, 1 KB, 2 KB, 4 KB).

Thanks,
Vyacheslav Dubeyko.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html