Re: Introducing Next3 - built-in snapshots support for Ext3

"Amir G." <amir73il@xxxxxxxxxxxxxxxxxxxxx> · Sat, 8 May 2010 21:40:12 +0200

On Sat, May 8, 2010 at 7:25 PM,  <tytso@xxxxxxx> wrote:
> On Sat, May 08, 2010 at 06:07:40PM +0200, Amir G. wrote:
>>
>> Next3 is another implementation of the extended f/s format.
>> Next3 is a superset of ext3 plus snapshots.
>
> As long as Next3 uses fields which have already assigned to ext4, this
> is a claim that you can not make correctly.  Because, you see, the
> ext4 is also an implementation of the extended f/s format, and those
> field assignments have already been made.
>
>> All overlapping field issues can be resolved.
>
> As long as you are willing to say that, then sure, let's work towards
> that goal.
>

Let me state my case then:

Next3 uses 1 assigned field (i_version), but it does not "abuse" it.
You see, Next3 only tampers with i_version of snapshot files.
And by tamper I mean: set it to next snapshot inode number on snapshot take.
And snapshot files are not modifiable by users (only by the f/s itself).
So if the f/s decides to assign an arbitrary value to i_version of
snapshot files,
it doesn't break the extended f/s format. does it?

Next3 also uses 9 i_flags bits (0x1FF00000), in snapshot file inodes only,
some currently overlapping flags recently assigned to Ext4 (you beat me to it).
There is a big waste in i_flag bits space, for example, the 4 bits
reserved for compression,
which are not in use by non-compressed files.
Snapshot files are never compressed, so I wouldn't mind reusing those
4 bits for snapshot flags.
Overloading auxiliary bits with different meanings depending on some
other bit does not make this a different f/s format.
It simply makes use of expensive space more efficiently.

>
> If you do the "move-on-write" trick, you just have to split the extent
> and do a COW of the extent tree and/or the inode.  So for a single
> block, the performance hit the same, yes?  But in the long-run, it's
> probably more efficient to do "move-on-write".
>

All metadata is COWed, inside the JBD hooks, so the extent tree and
inode are taken care of.
It is the data blocks which are being moved-on-write for efficiency.
The problem with splitting the extent is that when an application does
a lot of in-place writes to an extent mapped file,
it will eventually end up being broken down into tiny extents or
blocks and that is a problem. right?

>> There is an important design decision to make here.
>
> Technically speaking, it's possible to do it both way, yes?  I'm not
> sure why you consider this such a important design decision.  We can
> even play games where for some files we might do copy-on-write, and
> for some files, we do move-on-write.  It's always possible to check
> the COW bitmaps to decide what had happened.
>

Definitely yes! I never thought it would really have to come down to a
"decision",
because there is a trade-off at hand.
Even in Next3, without extents, it makes sense to have a choice of
write performance vs. fragmentation per file.
The few applications that use random in-place write (db, virtual disk)
would probably want to avoid the fragmentation.

> In any case, if this is all you have to do, I'm not sure why you said
> it was fundamentally impossible to support extents with the Next3
> design.
>

Wait just a minute! I said "not an easy task" and "break the design
concepts", but I never said (as far as I recall) "fundamentally
impossible". Well, perhaps "breaking the design concepts" was too
harsh :-)

I quote from Next3 wiki FAQ:
"Can Next3 snapshot support be applied to Ext4?
Most of the snapshot code can work on Ext4 as is, but the
move-on-write technique used for regular files data blocks will
require additional work before it can be applied to extent mapped
files."

I would have to say that "considerable amount of time" is the main
obstacle for the merge task.

So my humble and biased suggestion is:
let's start working with Next3, get to know it's strengths and weaknesses
and then design the nExt4 merge together.

Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html