Re: [PATCH v2] hfsplus: add journal replay

Sergei Antonov <saproj@xxxxxxxxx> · Thu, 13 Mar 2014 21:04:55 +0100

On 13 March 2014 11:20, Vyacheslav Dubeyko <slava@xxxxxxxxxxx> wrote:

> > I checked your code and made a bugreport.
>
> Your report is related to the issue that I don't re-read in my code a
> volume header after journal replay. This bug can be fixed easily. I've
> written to you about it yet.

But you do not reread VH on any device and the error was only on a 4K
sector device. OK, thanks for telling anyway.

> So, in such situation we have some freedom in field naming. Because
> specification doesn't contain any description.

Spec is not to be objected, reference implementation - have some
freedom. OK, you've made your point :).

Will you at least not object naming block_info::next
block_info::checksum (i.e. "checksum" instead of "next")? Just want to
know the limits of your esteem for this (dated) spec.

> The "folderCount" name contradicts to kernel coding style. I suppose
> that "subfolders" is much sensible and shorter name as "folder_count".
> Moreover, you've accepted my suggestion.
>
>> > (3) to distort a structure with comparing with Technical Note TN1150 (I
>> > mean struct block_list_header, for example).
>>
>> I'll explain.
>> block_list_header is not good because it describes sequential block
>> runs, not blocks. To avoid a double meaning of "block" I came up with:
>>   struct block_list_header (not really a block is meant) -> struct hfs_jrnl_list
>>   struct block_info (not really a block is meant) -> struct hfs_jrnl_run
>> Also changed fields:
>>   block_info::bnum (now comes the real block!) ->
>> hfs_jrnl_run::first_block (the 1st block of the run)
>>   block_info::bsize (size in blocks?) -> hfs_jrnl_run::byte_len (no,
>> it is in bytes! make it clear)
>
> I don't think that you've suggested better names. It is completely
> obscure and not informative for me, personally.

My reasoning did not have an effect. Oh, well.

By the way, did you notice how elegantly I fixed
block_list_header::binfo problem with binfo[0] meaning something
completely different than other elements? I shifted first element's
fields to the structure:

struct hfs_jrnl_list {
        u16 reserved;
        u16 count;                      /* number of runs plus 1 */
        u32 length_with_data;           /* length of the list and data */
        u32 checksum;                   /* checksum of the first 32 bytes */
        u32 flags;                      /* see possible flags below */
        u64 reserved1;                  /* unused part of 1st fake run */
        u32 reserved2;                  /* unused part of 1st fake run */
        u32 tr_end_or_seq_num;          /* Before sequence numbers introduction:
                                           zero value means end of transaction.
                                           After sequence numbers introduction:
                                           a non-zero sequence number. */
        struct hfs_jrnl_run runs[0];    /* number of elements is count-1 */
};

Isn't that clearer than the original struct? IMO the original
structure - that's what really deserves to be called "obscure and not
informative".

> And I always say NACK of your patch while we haven't naming of
> structures and fields as it described in specification of HFS+
> (Technical Note TN1150). Because specification is common point for
> everybody who try to understand HFS+ internals and functionality. And if
> your code doesn't comply with specification then everybody will have
> troubles with understanding of the code. Simply write your own file
> system if you want to name in your own way always. We have specification
> for HFS+.
>
> I prefer to leave mainline untouched then to have incorrect or obscure
> code for HFS+.
>
> Of course, it is possible to discuss every name but the common remark
> from my side is: (1) your names are longer; (2) your names are obscure.
>
>> > First of all, journal contains sequence of transactions. Technical Note
>> > TN1150: "A group of related changes is called a transaction. When all of
>> > the changes of a transaction have been written to their normal locations
>> > on disk, that transaction has been committed, and is removed from the
>> > journal.
>>
>> This is about journaling changes, not about replay.
>>
>
> This is related as journaling as journal replay.
>
>> > Secondly, for example, you had sudden power-off. It means that last
>> > transaction will be not finished.
>>
>> You have a poor understanding of HFS+ journal. After sudden power-off
>> the last transaction is good. The unfinished (not completely written)
>> transaction may be the one that is beyond header->data_end. The header
>> is updated only after it is completely on the disk. This is why header
>> is 1 sector long - header update is atomic, and the journal is
>> consistent any time.
>>
>> If you've found a corrupted transactions it is not a sudden power-off,
>> rather it is your bug :) or a bug in FS driver that wrote them. Wise
>> handling is: cry about it using printk, leave the journal unchanged,
>> mount for read-only. This is what my code does.
>
> I insist on transaction-based journal replay. I don't think that I have
> poor understanding of journaling at whole.
>
> There are many possible reasons that transaction in journal can be
> broken:
> (1) You can't guarantee that initiated flush will succeed. As a result,
> file system driver can think that all is OK with written data but really
> storage can have troubled data.
> (2) Even if data was written successfully then it doesn't mean that
> these data will be read successfully because of different storage'
> troubles. We live in not ideal world.

All these rare curruption cases are "cry about it using printk, leave
the journal unchanged".

Ok, what do you think of an implementation that works as follows?

if (a broken transaction is read) {
  1. stop reading journal /* obvious */
  2. replay all good transactions /* they are good and deserve it */
  3. do not update the journal header /* we do not want to deal with it */
  4. mount read-only /* safe behavior for any volume with a journal,
but in this case it is an absolute must */
}

It differs from my current logic in an added step 2.

> So, any transaction can be broken, potentially. Moreover, you can't
> guarantee anything with last transaction after sudden power-off. As a
> result, it makes sense to replay as much as possible valid transactions
> before discovering of invalid one.
>
>> > And, finally, if you firstly check all transactions then it means that
>> > you need to read all of its in memory before real replaying. And, as far
>> > as I can judge, you simply keep all sectors in memory before replay. For
>> > example, your system has 512 MB of RAM and HFS+ volume has fully filled
>> > journal with 512 MB in size. How lucky will you with journal replay on
>> > such system without a swap?
>>
>> Pretty lucky :) because:
>> 1st I store data to write to sectors, not all journal.
>> 2nd there is a feature of my code which significantly minimizes memory
>> usage. Different data in journal is destined to same sectors, and I
>> avoid allocating separate memory buffers for it. An awesome
>> <linux/interval_tree_generic.h> data structure helps me do it, see
>> hfsplus_replay_data_add().
>>
>> To defend my approach more, I can make measurements of memory
>> consumption for a 512 MB journal. Would you like me to?
>
> There are many possible situation that you doesn't take into account:
> (1) Journal can be 1 GB in size, for example. And some system can have
> 512 MB of RAM. And size of journal potentially can be much more,
> potentially.
> (2) You can't predict in what environment will work system and under
> what memory pressure a HFS+ volume will be mounted. So, trying to read
> all transactions in memory can be resulted with memory allocation
> failure with significant probability. And such approach is very
> dangerous way, from my point of view.

A threshold, as I already suggested, seems to be a good protection.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html