Hi Sergei, On Mon, 2014-02-24 at 11:05 +0100, Sergei Antonov wrote: We have really different vision of implementation for this functionality and I have many remarks about your implementation. So, I suggest to discuss principal points, firstly. We need to achieve understanding of each other and, then, it makes sense to implement joint solution. Currently, both of our implementations have shortcomings, from my point of view. > The replay code checks the integrity of the whole journal before actually > replaying sectors. If any error (e.g. checksum mismatch, invalid/misaligned > offsets) is detected, the volume will remain unchanged and the journal will not > be cleared. I think that such approach is overkill. And, moreover, this is a wrong approach. First of all, journal contains sequence of transactions. Technical Note TN1150: "A group of related changes is called a transaction. When all of the changes of a transaction have been written to their normal locations on disk, that transaction has been committed, and is removed from the journal. The journal may contain several transactions. Copying changes from all transactions to their normal locations on disk is called replaying the journal". So, transactions can be committed one after another and it leaves HFS+ volume in consistent state after each commit. It doesn't make sense to check all transactions preliminary and only then to commit. Secondly, for example, you had sudden power-off. It means that last transaction will be not finished. In other words, simply broken or invalid. In your approach you will reject the all transactions in the journal. But the goal of the journal is to commit all valid transactions before invalid one. Otherwise, file system remains in corrupted state without journal replay. And, finally, if you firstly check all transactions then it means that you need to read all of its in memory before real replaying. And, as far as I can judge, you simply keep all sectors in memory before replay. For example, your system has 512 MB of RAM and HFS+ volume has fully filled journal with 512 MB in size. How lucky will you with journal replay on such system without a swap? So, I use another approach in my code. I check transaction and then to replay it. And I think that transaction based journal replay is reasonable and right way. > Tested with: > drives with 512 and 4K sector sizes I've tested my code with 512 bytes, 1 KB, 2 KB, 4 KB, 8 KB, 16 KB, 32 KB, 64 KB block sizes. And I think that such testing is really necessary. I've found and fix several bugs namely for block sizes are greater than 4 KB. Your code is really complex and I am afraid that 8 - 64 KB block sizes will reveal some bugs. > default (8 MB) and custom (1 MB, 500 MB) journal sizes > big-endian (PowerPC) and little-endian (Intel) journal byte sexes My last version of the patchset is ready for as little-endian as big-endian journal. So, I can't see difference in our patches here. > Advantages over a previously presented > "[PATCH v4 00/15] hfsplus: introduce journal replay functionality" > by a different author: My name is not "different author". My name is Vyacheslav Dubeyko. If you want to insult me then you had achieved the goal. > * 12 times faster mount with replay (0.5s versus 6s on my test volume) OK. Let's talk about it. I had thought during implementation about more optimal bio submission. But I simply had chosen hfsplus_submit_bio() method. Because journal replay is not so frequent operation that it takes place only during mount operation. Usually, the journal contains several transactions after clean umount (or it can be simply empty). It can contain many transaction in the case of sudden power-off, for example. But such situation is not so frequent and it is more important for a end-user to have reliable journal replay. I suppose that every user is ready to wait a spare second (or even a minute) for achieving consistent state of a volume after mount. Because journal replay has goal to guarantee file system consistency but not performance. OK. I believe also that performance is important. But you had suggested really big and complicated implementation that, moreover, it is really hard to understand at one glance. I can suggest much, much and much simpler solution. First of all, transaction based journal replay is my principal position. Secondly, it is possible to submit reading of all transaction's blocks before starting to process transaction. Such submission can be made by means of optimized version of hfsplus_submit_bio() or likewise analog. Thirdly, moreover, it is possible to calculate block's checksum in bio->bi_end_io method for the case of block sizes lesser than 8 KB (512, 1 KB, 2 KB, 4 KB). Such method is called at the end of block processing by block layer. > * supports 4K sector devices Could you point out where do you make special efforts for it? And why do you need to implement something in this direction? I tried to find likewise implementation in your code but I failed. I suspect that I misunderstand something. > * no redundant stuff like journal initialization My patchset is first step in implementation of full-featured journaling support functionality for HFS+ driver. So, to implement journal initialization functionality with journal replay functionality is reasonable action. Because these two features are related to each others, from my viewpoint. Of course, it is possible not initialize journal during mount on current step of implementation. But to implement the journal initialization functionality on the step of implementation of journal replay functionality is really necessary. But, I think that it needs to initialize journal during mount of HFS+ volume (for the case of kJIJournalNeedInitMask flag). Because it will coincide with Technical Note TN1150. And if we have bugs in this functionality then it needs to fix it. Anyway, if we want to have full-featured journaling support functionality for HFS+ driver then we need to implement the journal initialization functionality. > * no fanatical use of macros with parameters I disagree that I use macros with fanaticism. I agree that it is really bad taste to use macros in C++. But it is common technique for C language. And this technique is widely used in Linux kernel. So, it is not my invention to use macros. Moreover, macros in my pathset are simple code patterns that it can be seen as simple abbreviation. For example: #define JIB_BLOCK(sb) \ (be32_to_cpu(HFSPLUS_SB(sb)->s_vhdr->journal_info_block)) I am unable to see anything bad in such macro. It shortens and to make more sensible a code in the place of using. But why do you use with fanaticism such macros as hfs_dbg(), be32_to_cpu() and so on? Stop using macros in your patch if you appeal about macro abstinence. > * shorter, easier to review Really? :) It is really hard to review your patch for me. Because it is big, complex and it doesn't be divided on logical parts. Your patch contains 1015 lines of code at whole. My patch contains 1598 lines of code at whole. If I exclude: (1) several big comments - 99 lines, (2) journal initialization functionality - ~192 lines, (3) "norecovery" mount option functionality - ~134 lines, then, I will have about 1173 lines of code in my patchset. So, 1015 lines of code vs. 1173 lines. It is not so big difference. And if you adopt my comments ~100 lines (I believe that my comments are really reasonable) and "norecovery" mount option ~100 lines of code then the difference will vanish completely. > @@ -166,6 +167,10 @@ struct hfsplus_sb_info { > u32 total_blocks; > u32 data_clump_blocks, rsrc_clump_blocks; > > + /* immutable data on the placement of the journal (if there is one) */ > + u64 journal_offset; /* relative to volume start */ > + u64 journal_size; /* relative to volume start */ > + My patchset is the first step in implementation of full featured journal support in HFS+ driver. So, I had suggested in my patchset to store in superblock info such structure is related to journal info: struct hfsplus_journal { struct mutex jnl_lock; /* Journal info block specific */ void *jib_buf; struct hfsplus_journal_info_block *jib; /* Journal header specific */ void *jh_buf; struct hfsplus_journal_header *jh; u8 endianness; /* TODO: Pointer of JBD */ struct super_block *sbp; }; I suppose that such structure is reasonable and it contains all necessary starting details for journaling on HFS+ volume. Otherwise, if we are talking about journal replay only, then, I don't see any necessity in keeping any info about journal. In such case, it doesn't makes sense to add in superblock info structure something is related to journal details. Journal replaying is all-sufficient operation. > > +/* journal.c */ > +int hfsplus_jrnl_replay(struct super_block *, u32, bool *); I really dislike "jrnl" abbreviation. Using "journal" instead of "jrnl" is really reasonable and sensible way. I don't think that it makes sense to economize on 3 symbols. I tried to understand how you process binfo array. But it is really hard to understand the whole patch. So, I simply ask. One block list (binfo array) can have size from 4KB till 16KB. Thereby, block list can be placed in several file system blocks. I have implemented and have tested special functionality for it. How your patch is ready for situation when binfo array will be placed in several blocks? > + > /* time macros */ > #define __hfsp_mt2ut(t) (be32_to_cpu(t) - 2082844800U) > #define __hfsp_ut2mt(t) (cpu_to_be32(t + 2082844800U)) > diff --git a/fs/hfsplus/hfsplus_raw.h b/fs/hfsplus/hfsplus_raw.h > index 8ffb3a8..73ba1b6 100644 > --- a/fs/hfsplus/hfsplus_raw.h > +++ b/fs/hfsplus/hfsplus_raw.h > @@ -105,7 +105,7 @@ struct hfsplus_vh { > __be16 version; > __be32 attributes; > __be32 last_mount_vers; > - u32 reserved; > + __be32 jrnl_info_block; > > __be32 create_date; > __be32 modify_date; > @@ -145,6 +145,69 @@ struct hfsplus_vh { > #define HFSPLUS_VOL_JOURNALED (1 << 13) > #define HFSPLUS_VOL_SOFTLOCK (1 << 15) > > +/* Journal info */ > +struct hfs_jrnl_info { > + __be32 flags; > + u32 unused[8]; > + __be64 journal_offset; /* Misaligned! relative to volume start */ > + __be64 journal_size; /* Misaligned! relative to volume start */ > +} __packed; > + > +/* Journal info flags */ > +#define HFSPLUS_JRNL_IN_FS 1 > +#define HFSPLUS_JRNL_ON_OTHER_DEVICE 2 > +#define HFSPLUS_JRNL_NEED_INIT 4 > + > +/* Journal header */ > +struct hfs_jrnl_header { > + u32 magic; > + u32 endian; > + u64 data_begin; /* relative to journal start */ > + u64 data_end; /* relative to journal start */ > + u64 journal_size; > + u32 list_size; /* size of hfs_jrnl_list */ > + u32 checksum; > + u32 hdr_size; /* Header size. Tells us where journal buffer > + * begins. Also works as journal block size. */ > + u32 last_seq_num; /* transaction sequence number */ > +}; > + > +#define HFSPLUS_JRNL_HDR_MAGIC 0x4a4e4c78 > +#define HFSPLUS_JRNL_HDR_ENDIAN 0x12345678 > + > +/* Journal data run. Describes sequential data stored in the journal. */ > +struct hfs_jrnl_run { > + u64 first_block; /* Block number the run begins at. Note that it > + * is journal blocks, not filesystem blocks. > + * Value ~0 means the run should be skipped. */ > + u32 byte_len; /* Data size in bytes. Multiple of > + * journal block size. */ > + u32 checksum; /* Data checksum. */ > +}; > + > +/* > + * Journal list. Describes one or more journaled data runs. > + * It is normally a single transaction. > + */ > +struct hfs_jrnl_list { > + u16 reserved; > + u16 count; /* number of runs plus 1 */ > + u32 length_with_data; /* length of the list and data */ > + u32 checksum; /* checksum of the first 32 bytes */ > + u32 flags; /* see possible flags below */ > + u64 reserved1; /* unused part of 1st fake run */ > + u32 reserved2; /* unused part of 1st fake run */ > + u32 tr_end_or_seq_num; /* Before sequence numbers introduction: > + zero value means end of transaction. > + After sequence numbers introduction: > + a non-zero sequence number. */ > + struct hfs_jrnl_run runs[0]; /* number of elements is count-1 */ > +}; > + I really dislike the idea: (1) to name structures in really different manner as in Technical Note TN1150; (2) to name fields of structures in really different manner as in Technical Note TN1150; (3) to distort a structure with comparing with Technical Note TN1150 (I mean struct block_list_header, for example). In such way you simply confuse everybody. Because your code becomes hardly understandable and obscure. From another point of view, you added last_seq_num field in journal header. But, actually, you don't use this field for comparing this value with value in a last transaction. So, declarations in my patchset are much better. Moreover, usually, it is used __le or __be types for declarations of on-disk layout's types. Yes, I understand that HFS+ journal can be as little-endian as big-endian. And I think that Intel architecture is more frequent case now. So, it makes sense to use __le types for on-disk layout HFS+ journal declarations (in the case journal header, block list header and block info). I really dislike idea to use u16, u32 and u64 types for on-disk layout declarations. It's really confusing and it doesn't provide any profit. There are four possible principal situations: (1) LE system <-> LE journal; (2) BE system <-> LE journal; (3) LE system <-> BE journal; (4) BE system <-> BE journal. So, you need to use xx_to_cpu() and cpu_to_xx() family of methods, anyway. > diff --git a/fs/hfsplus/super.c b/fs/hfsplus/super.c > index 80875aa..84b832e 100644 > --- a/fs/hfsplus/super.c > +++ b/fs/hfsplus/super.c > @@ -403,6 +403,7 @@ static int hfsplus_fill_super(struct super_block *sb, void *data, int silent) > goto out_unload_nls; > } > > +reread_vhdr: > /* Grab the volume header */ > if (hfsplus_read_wrapper(sb)) { OK. I have the bug here in my patchset. But it can be easy to fix by adoption your approach here. > if (!silent) > @@ -447,6 +448,31 @@ static int hfsplus_fill_super(struct super_block *sb, void *data, int silent) > sb->s_op = &hfsplus_sops; > sb->s_maxbytes = MAX_LFS_FILESIZE; > > + if ((vhdr->attributes & cpu_to_be32(HFSPLUS_VOL_JOURNALED)) && > + !(sb->s_flags & MS_RDONLY)) { > + bool journal_was_empty; > + int replay_error; > + > + pr_warn("write access to a journaled filesystem is not supported, use the force option at your own risk, mounting read-only.\n"); > + sb->s_flags |= MS_RDONLY; > + > + replay_error = hfsplus_jrnl_replay(sb, > + be32_to_cpu(vhdr->jrnl_info_block), &journal_was_empty); > + if (replay_error) > + pr_err("journal replay error %d\n", replay_error); > + > + /* > + * Reread volume header after replay, no matter if it succeeded > + * or not. We've just set read-only flag, so no further replay > + * attempts will happen. > + */ > + kfree(sbi->s_vhdr_buf); > + kfree(sbi->s_backup_vhdr_buf); > + sbi->s_vhdr_buf = 0; > + sbi->s_backup_vhdr_buf = 0; > + goto reread_vhdr; > + } > + > if (!(vhdr->attributes & cpu_to_be32(HFSPLUS_VOL_UNMNT))) { > pr_warn("Filesystem was not cleanly unmounted, running fsck.hfsplus is recommended. mounting read-only.\n"); > sb->s_flags |= MS_RDONLY; > @@ -455,10 +481,6 @@ static int hfsplus_fill_super(struct super_block *sb, void *data, int silent) > } else if (vhdr->attributes & cpu_to_be32(HFSPLUS_VOL_SOFTLOCK)) { > pr_warn("Filesystem is marked locked, mounting read-only.\n"); > sb->s_flags |= MS_RDONLY; > - } else if ((vhdr->attributes & cpu_to_be32(HFSPLUS_VOL_JOURNALED)) && > - !(sb->s_flags & MS_RDONLY)) { > - pr_warn("write access to a journaled filesystem is not supported, use the force option at your own risk, mounting read-only.\n"); > - sb->s_flags |= MS_RDONLY; Simple removing this lines is wrong. It needs to rework mount options. Adding of journal replay means necessity to rework mount options. The "force" mount option was suggested because absence of journal replay functionality. So, this reworking is really necessary. And I had reworked mount options in my patchset by means of deleting of "force" and adding of "norecovery" mount options. Otherwise, leaving mount options untouched looks like a schizophrenia. [TO RESUME] I have such principal points: (1) On-disk layout declarations should live in peace with Technical Note TN1150; (2) Journal replay should be transaction based. I mean that it needs to check transaction and to commit one by one. (3) "Force" mount option should be changed on "norecovery" mount option. (4) Journal initialization should be implemented. As a result, I think that my patchset is much better as starting point for joint implementation with you. I suggest: (1) To fix in my patchset bugs and shortcomings by means of adoption of some solutions from your patch (reread superblock, using get_unaligned_be64() and so on); (2) Journal replay should be transaction based. So, it makes sense to implement optimization of transaction processing by means of read-ahead of transaction's blocks before transaction processing. Such submission can be made by means of optimized version of hfsplus_submit_bio() or likewise analog. Moreover, it is possible to calculate block's checksum in bio->bi_end_io method for the case of block sizes lesser than 8 KB (512, 1 KB, 2 KB, 4 KB). Thanks, Vyacheslav Dubeyko. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html