Re: [PATCH v2] hfsplus: add journal replay

Sergei Antonov <saproj@xxxxxxxxx> · Sun, 9 Mar 2014 23:50:09 +0100

On 9 March 2014 17:36, Vyacheslav Dubeyko <slava@xxxxxxxxxxx> wrote:
> Hi Sergei,
>
> On Mon, 2014-02-24 at 11:05 +0100, Sergei Antonov wrote:
>
> We have really different vision of implementation for this functionality
> and I have many remarks about your implementation. So, I suggest to
> discuss principal points, firstly. We need to achieve understanding of
> each other and, then, it makes sense to implement joint solution.
> Currently, both of our implementations have shortcomings, from my point
> of view.
>
>> The replay code checks the integrity of the whole journal before actually
>> replaying sectors. If any error (e.g. checksum mismatch, invalid/misaligned
>> offsets) is detected, the volume will remain unchanged and the journal will not
>> be cleared.
>
> I think that such approach is overkill. And, moreover, this is a wrong
> approach.
>
> First of all, journal contains sequence of transactions. Technical Note
> TN1150: "A group of related changes is called a transaction. When all of
> the changes of a transaction have been written to their normal locations
> on disk, that transaction has been committed, and is removed from the
> journal.

This is about journaling changes, not about replay.

> The journal may contain several transactions. Copying changes
> from all transactions to their normal locations on disk is called
> replaying the journal". So, transactions can be committed one after
> another and it leaves HFS+ volume in consistent state after each commit.

Of course! They can also be committed in groups. As long as you follow
the rule: first replay a transaction or a bunch of them then update
journal header to remove them.

> It doesn't make sense to check all transactions preliminary and only
> then to commit.

One paragraph earlier it was wrong, now it just doesn't make sense :).

With regard to speed it makes sense. It's less IO. And also a more
sequential IO (I read journal sequentially without interruptions, then
replay all collected sectors sorted by number.)

> Secondly, for example, you had sudden power-off. It means that last
> transaction will be not finished.

You have a poor understanding of HFS+ journal. After sudden power-off
the last transaction is good. The unfinished (not completely written)
transaction may be the one that is beyond header->data_end. The header
is updated only after it is completely on the disk. This is why header
is 1 sector long - header update is atomic, and the journal is
consistent any time.

If you've found a corrupted transactions it is not a sudden power-off,
rather it is your bug :) or a bug in FS driver that wrote them. Wise
handling is: cry about it using printk, leave the journal unchanged,
mount for read-only. This is what my code does.

> And, finally, if you firstly check all transactions then it means that
> you need to read all of its in memory before real replaying. And, as far
> as I can judge, you simply keep all sectors in memory before replay. For
> example, your system has 512 MB of RAM and HFS+ volume has fully filled
> journal with 512 MB in size. How lucky will you with journal replay on
> such system without a swap?

Pretty lucky :) because:
1st I store data to write to sectors, not all journal.
2nd there is a feature of my code which significantly minimizes memory
usage. Different data in journal is destined to same sectors, and I
avoid allocating separate memory buffers for it. An awesome
<linux/interval_tree_generic.h> data structure helps me do it, see
hfsplus_replay_data_add().

To defend my approach more, I can make measurements of memory
consumption for a 512 MB journal. Would you like me to?

> So, I use another approach in my code. I check transaction and then to
> replay it. And I think that transaction based journal replay is
> reasonable and right way.
>
>> Tested with:
>>  drives with 512 and 4K sector sizes
>
> I've tested my code with 512 bytes, 1 KB, 2 KB, 4 KB, 8 KB, 16 KB, 32
> KB, 64 KB block sizes. And I think that such testing is really
> necessary. I've found and fix several bugs namely for block sizes are
> greater than 4 KB.

But I tested with 4K iPad Nano. My code works right, your code caused
file system errors after replay. I sent fsck_hfs output and relevant
lines from system log in your thread titled "[PATCH v4 00/15] hfsplus:
introduce journal replay functionality".

> Your code is really complex and I am afraid that 8 -
> 64 KB block sizes will reveal some bugs.

Well, check it. Do not just spread FUD.
I checked your code and made a bugreport.
My journal.c has no known bugs so far. Find one.

>>  default (8 MB) and custom (1 MB, 500 MB) journal sizes
>>  big-endian (PowerPC) and little-endian (Intel) journal byte sexes
>
> My last version of the patchset is ready for as little-endian as
> big-endian journal. So, I can't see difference in our patches here.
>
>> Advantages over a previously presented
>>    "[PATCH v4 00/15] hfsplus: introduce journal replay functionality"
>>    by a different author:
>
> My name is not "different author". My name is Vyacheslav Dubeyko. If you
> want to insult me then you had achieved the goal.

No flame, please :).

>> * 12 times faster mount with replay (0.5s versus 6s on my test volume)
>
> OK. Let's talk about it.
>
> I had thought during implementation about more optimal bio submission.
> But I simply had chosen hfsplus_submit_bio() method. Because journal
> replay is not so frequent operation that it takes place only during
> mount operation. Usually, the journal contains several transactions
> after clean umount (or it can be simply empty). It can contain many
> transaction in the case of sudden power-off, for example. But such
> situation is not so frequent and it is more important for a end-user to
> have reliable journal replay. I suppose that every user is ready to wait
> a spare second (or even a minute) for achieving consistent state of a
> volume after mount. Because journal replay has goal to guarantee file
> system consistency but not performance.
>
> OK. I believe also that performance is important. But you had suggested
> really big and complicated implementation

Mine is 5 files changed, 1009 insertions(+), 6 deletions(-).
Yours is 9 files changed, 1559 insertions(+), 39 deletions(-).

> that, moreover, it is really
> hard to understand at one glance. I can suggest much, much and much
> simpler solution. First of all, transaction based journal replay is my
> principal position. Secondly, it is possible to submit reading of all
> transaction's blocks before starting to process transaction. Such
> submission can be made by means of optimized version of
> hfsplus_submit_bio() or likewise analog. Thirdly, moreover, it is
> possible to calculate block's checksum in bio->bi_end_io method for the
> case of block sizes lesser than 8 KB (512, 1 KB, 2 KB, 4 KB). Such
> method is called at the end of block processing by block layer.
>
>> * supports 4K sector devices
>
> Could you point out where do you make special efforts for it? And why do
> you need to implement something in this direction?

There is no special place in code. The code is sector-size-agnostic
and I tested with a 4K-sector device.

> I tried to find likewise implementation in your code but I failed. I
> suspect that I misunderstand something.
>
>> * no redundant stuff like journal initialization
>
> My patchset is first step in implementation of full-featured journaling
> support functionality for HFS+ driver. So, to implement journal
> initialization functionality with journal replay functionality is
> reasonable action. Because these two features are related to each
> others, from my viewpoint.
>
> Of course, it is possible not initialize journal during mount on current
> step of implementation. But to implement the journal initialization
> functionality on the step of implementation of journal replay
> functionality is really necessary.

It is not necessary for journal replay. Quite the opposite - it has
nothing to do with it.

> But, I think that it needs to initialize journal during mount of HFS+
> volume (for the case of kJIJournalNeedInitMask flag). Because it will
> coincide with Technical Note TN1150.

I answered to this your argument before. Copy-pasting:
=======
Which does not mean initialization is necessary. It is, of course,
necessary for drivers that write to journal because they need the
journal to write transactions to. But in this driver you initialize it
and ... what? Nothing. It will not be used.
=======

> And if we have bugs in this
> functionality then it needs to fix it. Anyway, if we want to have
> full-featured journaling support functionality for HFS+ driver then we
> need to implement the journal initialization functionality.
>
>> * no fanatical use of macros with parameters
>
> I disagree that I use macros with fanaticism. I agree that it is really
> bad taste to use macros in C++. But it is common technique for C
> language. And this technique is widely used in Linux kernel. So, it is
> not my invention to use macros.
>
> Moreover, macros in my pathset are simple code patterns that it can be
> seen as simple abbreviation. For example:
>
> #define JIB_BLOCK(sb) \
>         (be32_to_cpu(HFSPLUS_SB(sb)->s_vhdr->journal_info_block))
>
> I am unable to see anything bad in such macro. It shortens and to make
> more sensible a code in the place of using.
>
> But why do you use with fanaticism such macros as hfs_dbg(),
> be32_to_cpu() and so on? Stop using macros in your patch if you appeal
> about macro abstinence.

I do not inroduce macros with params.
An inline function would do the job instead of JIB_BLOCK you mention.

>> * shorter, easier to review
>
> Really? :)
>
> It is really hard to review your patch for me. Because it is big,
> complex and it doesn't be divided on logical parts.

My patch is not quite divisible because of 1015 lines 909 are the new
file journal.c and 65 lines are new on-disk structures in .h. Other 41
lines are minimal to the point of being atomic :).
On the site I replyed above.

> Your patch contains 1015 lines of code at whole.
>
> My patch contains 1598 lines of code at whole. If I exclude: (1) several
> big comments - 99 lines, (2) journal initialization functionality - ~192
> lines, (3) "norecovery" mount option functionality - ~134 lines, then, I
> will have about 1173 lines of code in my patchset.
>
> So, 1015 lines of code vs. 1173 lines. It is not so big difference. And
> if you adopt my comments ~100 lines (I believe that my comments are
> really reasonable) and "norecovery" mount option ~100 lines of code then
> the difference will vanish completely.
>
>> @@ -166,6 +167,10 @@ struct hfsplus_sb_info {
>>       u32 total_blocks;
>>       u32 data_clump_blocks, rsrc_clump_blocks;
>>
>> +     /* immutable data on the placement of the journal (if there is one) */
>> +     u64 journal_offset;     /* relative to volume start */
>> +     u64 journal_size;       /* relative to volume start */
>> +
>
> My patchset is the first step in implementation of full featured journal
> support in HFS+ driver. So, I had suggested in my patchset to store in
> superblock info such structure is related to journal info:
>
> struct hfsplus_journal {
>         struct mutex jnl_lock;
>
>         /* Journal info block specific */
>         void *jib_buf;
>         struct hfsplus_journal_info_block *jib;
>
>         /* Journal header specific */
>         void *jh_buf;
>         struct hfsplus_journal_header *jh;
>         u8 endianness;
>
>         /* TODO: Pointer of JBD */
>
>         struct super_block *sbp;
> };
>
> I suppose that such structure is reasonable and it contains all
> necessary starting details for journaling on HFS+ volume.
>
> Otherwise, if we are talking about journal replay only, then, I don't
> see any necessity in keeping any info about journal. In such case, it
> doesn't makes sense to add in superblock info structure something is
> related to journal details. Journal replaying is all-sufficient
> operation.

It is an unimportant issue IMO. I can even get rid of all changes to
struct hfsplus_sb_info in v3 version of my patch. A big special
structure may be introduced later, when and if journaling modification
code is developed.

>> +/* journal.c */
>> +int hfsplus_jrnl_replay(struct super_block *, u32, bool *);
>
> I really dislike "jrnl" abbreviation. Using "journal" instead of "jrnl"
> is really reasonable and sensible way. I don't think that it makes sense
> to economize on 3 symbols.

Oh, that isn't worth much discussion. You'd like a longer function prefix? OK.

> I tried to understand how you process binfo array. But it is really hard
> to understand the whole patch. So, I simply ask. One block list (binfo
> array) can have size from 4KB till 16KB. Thereby, block list can be
> placed in several file system blocks. I have implemented and have tested
> special functionality for it. How your patch is ready for situation when
> binfo array will be placed in several blocks?

If these are sequential blocks on disk, they are read through bio. Bio
can read several blocks at once.
If the structure is split into two parts, it is read in two parts.
See hfsplus_jrnl_queue_read() for details. The execution gets there
from hfsplus_jrnl_read_lists() -> hfsplus_jrnl_read_buf().

>> +
>>  /* time macros */
>>  #define __hfsp_mt2ut(t)              (be32_to_cpu(t) - 2082844800U)
>>  #define __hfsp_ut2mt(t)              (cpu_to_be32(t + 2082844800U))
>> diff --git a/fs/hfsplus/hfsplus_raw.h b/fs/hfsplus/hfsplus_raw.h
>> index 8ffb3a8..73ba1b6 100644
>> --- a/fs/hfsplus/hfsplus_raw.h
>> +++ b/fs/hfsplus/hfsplus_raw.h
>> @@ -105,7 +105,7 @@ struct hfsplus_vh {
>>       __be16 version;
>>       __be32 attributes;
>>       __be32 last_mount_vers;
>> -     u32 reserved;
>> +     __be32 jrnl_info_block;
>>
>>       __be32 create_date;
>>       __be32 modify_date;
>> @@ -145,6 +145,69 @@ struct hfsplus_vh {
>>  #define HFSPLUS_VOL_JOURNALED                (1 << 13)
>>  #define HFSPLUS_VOL_SOFTLOCK         (1 << 15)
>>
>> +/* Journal info */
>> +struct hfs_jrnl_info {
>> +     __be32 flags;
>> +     u32 unused[8];
>> +     __be64 journal_offset;  /* Misaligned! relative to volume start */
>> +     __be64 journal_size;    /* Misaligned! relative to volume start */
>> +} __packed;
>> +
>> +/* Journal info flags */
>> +#define HFSPLUS_JRNL_IN_FS           1
>> +#define HFSPLUS_JRNL_ON_OTHER_DEVICE 2
>> +#define HFSPLUS_JRNL_NEED_INIT               4
>> +
>> +/* Journal header */
>> +struct hfs_jrnl_header {
>> +     u32 magic;
>> +     u32 endian;
>> +     u64 data_begin;         /* relative to journal start */
>> +     u64 data_end;           /* relative to journal start */
>> +     u64 journal_size;
>> +     u32 list_size;          /* size of hfs_jrnl_list */
>> +     u32 checksum;
>> +     u32 hdr_size;           /* Header size. Tells us where journal buffer
>> +                              * begins. Also works as journal block size. */
>> +     u32 last_seq_num;               /* transaction sequence number */
>> +};
>> +
>> +#define HFSPLUS_JRNL_HDR_MAGIC       0x4a4e4c78
>> +#define HFSPLUS_JRNL_HDR_ENDIAN      0x12345678
>> +
>> +/* Journal data run. Describes sequential data stored in the journal. */
>> +struct hfs_jrnl_run {
>> +     u64 first_block;        /* Block number the run begins at. Note that it
>> +                              * is journal blocks, not filesystem blocks.
>> +                              * Value ~0 means the run should be skipped. */
>> +     u32 byte_len;           /* Data size in bytes. Multiple of
>> +                              * journal block size. */
>> +     u32 checksum;           /* Data checksum. */
>> +};
>> +
>> +/*
>> + * Journal list. Describes one or more journaled data runs.
>> + * It is normally a single transaction.
>> + */
>> +struct hfs_jrnl_list {
>> +     u16 reserved;
>> +     u16 count;                      /* number of runs plus 1 */
>> +     u32 length_with_data;           /* length of the list and data */
>> +     u32 checksum;                   /* checksum of the first 32 bytes */
>> +     u32 flags;                      /* see possible flags below */
>> +     u64 reserved1;                  /* unused part of 1st fake run */
>> +     u32 reserved2;                  /* unused part of 1st fake run */
>> +     u32 tr_end_or_seq_num;          /* Before sequence numbers introduction:
>> +                                        zero value means end of transaction.
>> +                                        After sequence numbers introduction:
>> +                                        a non-zero sequence number. */
>> +     struct hfs_jrnl_run runs[0];    /* number of elements is count-1 */
>> +};
>> +
>
> I really dislike the idea:
> (1) to name structures in really different manner as in Technical Note
> TN1150;

See explanation for (3) below.

> (2) to name fields of structures in really different manner as in
> Technical Note TN1150;

You contradict yourself sometimes :). I remember when you yourself
suggested changing Apple's name of a structure field. It was when we
discussed another patch:
========
What about shorter name as "subfolders" for both fields in struct
hfsplus_inode_info and in struct hfsplus_cat_folder? I suppose it will
be more informative and correct.
========
Apple's name was "folderCount", my name was after it: "folder_count",
you suggested a different, but indeed better, "subfolders". In my
patch I also found several better (IMO) names.

> (3) to distort a structure with comparing with Technical Note TN1150 (I
> mean struct block_list_header, for example).

I'll explain.
block_list_header is not good because it describes sequential block
runs, not blocks. To avoid a double meaning of "block" I came up with:
  struct block_list_header (not really a block is meant) -> struct hfs_jrnl_list
  struct block_info (not really a block is meant) -> struct hfs_jrnl_run
Also changed fields:
  block_info::bnum (now comes the real block!) ->
hfs_jrnl_run::first_block (the 1st block of the run)
  block_info::bsize (size in blocks?) -> hfs_jrnl_run::byte_len (no,
it is in bytes! make it clear)

> In such way you simply confuse everybody. Because your code becomes
> hardly understandable and obscure. From another point of view, you added
> last_seq_num field in journal header. But, actually, you don't use this
> field for comparing this value with value in a last transaction.

It is used for tracing:
+       hfs_dbg(JOURNAL,
+               "last seq num in journal data 0x%x, last seq num in
header 0x%x\n",
+               rd->last_list_seq_num, rd->header->last_seq_num);

What do you use that comparison for?
I guess, comparison is only needed for extra transactions replay. Do
you know what extra transactions in HFS+ journal are? They are not in
the spec, they are in the code. Search for "extra transactions" in
Apple's vfs_journal.c. Weird feature, I've newer seen it on a volume.

> So,
> declarations in  my patchset are much better.
>
> Moreover, usually, it is used __le or __be types for declarations of
> on-disk layout's types. Yes, I understand that HFS+ journal can be as
> little-endian as big-endian. And I think that Intel architecture is more
> frequent case now. So, it makes sense to use __le types for on-disk
> layout HFS+ journal declarations (in the case journal header, block list
> header and block info).
>
> I really dislike idea to use u16, u32 and u64 types for on-disk layout
> declarations. It's really confusing and it doesn't provide any profit.
> There are four possible principal situations:
> (1) LE system <-> LE journal;
> (2) BE system <-> LE journal;
> (3) LE system <-> BE journal;
> (4) BE system <-> BE journal.
> So, you need to use xx_to_cpu() and cpu_to_xx() family of methods,
> anyway.

I need? No. For cases 2 and 3 I use swab32() and swap them in-place,
this is why u32/u16 are reasonable. In cases 1 and 4 the values are
already in cpu order, and u32/u16 are absolutely reasonable.

>> diff --git a/fs/hfsplus/super.c b/fs/hfsplus/super.c
>> index 80875aa..84b832e 100644
>> --- a/fs/hfsplus/super.c
>> +++ b/fs/hfsplus/super.c
>> @@ -403,6 +403,7 @@ static int hfsplus_fill_super(struct super_block *sb, void *data, int silent)
>>               goto out_unload_nls;
>>       }
>>
>> +reread_vhdr:
>>       /* Grab the volume header */
>>       if (hfsplus_read_wrapper(sb)) {
>
> OK. I have the bug here in my patchset. But it can be easy to fix by
> adoption your approach here.
>
>>               if (!silent)
>> @@ -447,6 +448,31 @@ static int hfsplus_fill_super(struct super_block *sb, void *data, int silent)
>>       sb->s_op = &hfsplus_sops;
>>       sb->s_maxbytes = MAX_LFS_FILESIZE;
>>
>> +     if ((vhdr->attributes & cpu_to_be32(HFSPLUS_VOL_JOURNALED)) &&
>> +         !(sb->s_flags & MS_RDONLY)) {
>> +             bool journal_was_empty;
>> +             int replay_error;
>> +
>> +             pr_warn("write access to a journaled filesystem is not supported, use the force option at your own risk, mounting read-only.\n");
>> +             sb->s_flags |= MS_RDONLY;
>> +
>> +             replay_error = hfsplus_jrnl_replay(sb,
>> +                     be32_to_cpu(vhdr->jrnl_info_block), &journal_was_empty);
>> +             if (replay_error)
>> +                     pr_err("journal replay error %d\n", replay_error);
>> +
>> +             /*
>> +              * Reread volume header after replay, no matter if it succeeded
>> +              * or not. We've just set read-only flag, so no further replay
>> +              * attempts will happen.
>> +              */
>> +             kfree(sbi->s_vhdr_buf);
>> +             kfree(sbi->s_backup_vhdr_buf);
>> +             sbi->s_vhdr_buf = 0;
>> +             sbi->s_backup_vhdr_buf = 0;
>> +             goto reread_vhdr;
>> +     }
>> +
>>       if (!(vhdr->attributes & cpu_to_be32(HFSPLUS_VOL_UNMNT))) {
>>               pr_warn("Filesystem was not cleanly unmounted, running fsck.hfsplus is recommended.  mounting read-only.\n");
>>               sb->s_flags |= MS_RDONLY;
>> @@ -455,10 +481,6 @@ static int hfsplus_fill_super(struct super_block *sb, void *data, int silent)
>>       } else if (vhdr->attributes & cpu_to_be32(HFSPLUS_VOL_SOFTLOCK)) {
>>               pr_warn("Filesystem is marked locked, mounting read-only.\n");
>>               sb->s_flags |= MS_RDONLY;
>> -     } else if ((vhdr->attributes & cpu_to_be32(HFSPLUS_VOL_JOURNALED)) &&
>> -                     !(sb->s_flags & MS_RDONLY)) {
>> -             pr_warn("write access to a journaled filesystem is not supported, use the force option at your own risk, mounting read-only.\n");
>> -             sb->s_flags |= MS_RDONLY;
>
> Simple removing this lines is wrong. It needs to rework mount options.
>
> Adding of journal replay means necessity to rework mount options. The
> "force" mount option was suggested because absence of journal replay
> functionality. So, this reworking is really necessary. And I had
> reworked mount options in my patchset by means of deleting of "force"
> and adding of "norecovery" mount options. Otherwise, leaving mount
> options untouched looks like a schizophrenia.

I'd like to make our common patch and I'd totally trust you to do
changes to mount options. You are good at it.

> [TO RESUME]
>
> I have such principal points:
> (1) On-disk layout declarations should live in peace with Technical Note
> TN1150;

Well, I gave my rationale. The discussion can be continued. I am not
against changes to my cool names.

> (2) Journal replay should be transaction based. I mean that it needs to
> check transaction and to commit one by one.

I'd agree on commit at some threshold. If collected data exceeds 1
megabyte, then write transactions and free some memory.

> (3) "Force" mount option should be changed on "norecovery" mount option.

As you wish.

> (4) Journal initialization should be implemented.

Unneeded and it makes a patch heavier.

> As a result, I think that my patchset is much better as starting point
> for joint implementation with you.
>
> I suggest:
> (1) To fix in my patchset bugs and shortcomings by means of adoption of
> some solutions from your patch (reread superblock, using
> get_unaligned_be64() and so on);
> (2) Journal replay should be transaction based. So, it makes sense to
> implement optimization of transaction processing by means of read-ahead
> of transaction's blocks before transaction processing. Such submission
> can be made by means of optimized version of hfsplus_submit_bio() or
> likewise analog. Moreover, it is possible to calculate block's checksum
> in bio->bi_end_io method for the case of block sizes lesser than 8 KB
> (512, 1 KB, 2 KB, 4 KB).

It is possible to calculate checksums, I already do it in my code. But
why bio->bi_end_io? I do not get the idea.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html