On Sat, May 8, 2010 at 7:25 PM, <tytso@xxxxxxx> wrote: > On Sat, May 08, 2010 at 06:07:40PM +0200, Amir G. wrote: >> >> Next3 is another implementation of the extended f/s format. >> Next3 is a superset of ext3 plus snapshots. > > As long as Next3 uses fields which have already assigned to ext4, this > is a claim that you can not make correctly. Because, you see, the > ext4 is also an implementation of the extended f/s format, and those > field assignments have already been made. > >> All overlapping field issues can be resolved. > > As long as you are willing to say that, then sure, let's work towards > that goal. > Let me state my case then: Next3 uses 1 assigned field (i_version), but it does not "abuse" it. You see, Next3 only tampers with i_version of snapshot files. And by tamper I mean: set it to next snapshot inode number on snapshot take. And snapshot files are not modifiable by users (only by the f/s itself). So if the f/s decides to assign an arbitrary value to i_version of snapshot files, it doesn't break the extended f/s format. does it? Next3 also uses 9 i_flags bits (0x1FF00000), in snapshot file inodes only, some currently overlapping flags recently assigned to Ext4 (you beat me to it). There is a big waste in i_flag bits space, for example, the 4 bits reserved for compression, which are not in use by non-compressed files. Snapshot files are never compressed, so I wouldn't mind reusing those 4 bits for snapshot flags. Overloading auxiliary bits with different meanings depending on some other bit does not make this a different f/s format. It simply makes use of expensive space more efficiently. > > If you do the "move-on-write" trick, you just have to split the extent > and do a COW of the extent tree and/or the inode. So for a single > block, the performance hit the same, yes? But in the long-run, it's > probably more efficient to do "move-on-write". > All metadata is COWed, inside the JBD hooks, so the extent tree and inode are taken care of. It is the data blocks which are being moved-on-write for efficiency. The problem with splitting the extent is that when an application does a lot of in-place writes to an extent mapped file, it will eventually end up being broken down into tiny extents or blocks and that is a problem. right? >> There is an important design decision to make here. > > Technically speaking, it's possible to do it both way, yes? I'm not > sure why you consider this such a important design decision. We can > even play games where for some files we might do copy-on-write, and > for some files, we do move-on-write. It's always possible to check > the COW bitmaps to decide what had happened. > Definitely yes! I never thought it would really have to come down to a "decision", because there is a trade-off at hand. Even in Next3, without extents, it makes sense to have a choice of write performance vs. fragmentation per file. The few applications that use random in-place write (db, virtual disk) would probably want to avoid the fragmentation. > In any case, if this is all you have to do, I'm not sure why you said > it was fundamentally impossible to support extents with the Next3 > design. > Wait just a minute! I said "not an easy task" and "break the design concepts", but I never said (as far as I recall) "fundamentally impossible". Well, perhaps "breaking the design concepts" was too harsh :-) I quote from Next3 wiki FAQ: "Can Next3 snapshot support be applied to Ext4? Most of the snapshot code can work on Ext4 as is, but the move-on-write technique used for regular files data blocks will require additional work before it can be applied to extent mapped files." I would have to say that "considerable amount of time" is the main obstacle for the merge task. So my humble and biased suggestion is: let's start working with Next3, get to know it's strengths and weaknesses and then design the nExt4 merge together. Amir. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html