Re: Help on Implementation of EXT3 type Ordered Mode in EXT4

Kailas Joshi <kailas.joshi@xxxxxxxxx> · Thu, 11 Feb 2010 13:02:15 +0530

On 11 February 2010 12:31, Kailas Joshi <kailas.joshi@xxxxxxxxx> wrote:
>
> On 9 February 2010 23:11, <tytso@xxxxxxx> wrote:
>>
>> On Tue, Feb 09, 2010 at 05:05:22PM +0100, Jan Kara wrote:
>> >   Hi,
>> >
>> > > I recently found that in EXT4 with delayed block the Ordered mode does not
>> > > bahave same as in EXT3.
>> > > I found a patch for this at http://lwn.net/Articles/324023/, but it has some
>> > > journal block estimation problem resulting into deadlock.
>> > >
>> > > I would like to know if it has been solved.
>> > > If not, is it possible to solve it? What are the complexities involved?
>> >
>> > It has not been solved. The problem is that to commit data on
>> > transaction commit (which is what data=ordered mode has historically
>> > done), you have to allocate space for these blocks. But that
>> > allocation needs to modify a filesystem and thus journal more
>> > blocks... And that is tricky - we would have to reserve space in the
>> > current transaction for allocation of delayed data.  So it gets a
>> > bit messy...
>>
>> The dioread_nolock patches from Jiaying, which are currently in the
>> unstable portion of the tree, is a partial solution to the
>> data=ordered problem, although it solves it in a slightly different
>> way.
>>
>> As a side effect of trying to avoid locking on the direct I/O read
>> path, on the buffered I/O write path it changes things so the extent
>> tree is first changed so the blocks are allocated with the "extent
>> uninitialized" bit, and then only after the blocks hit the disk, via
>> the bh completion callback, do we set the extent so that it is marked
>> as containing initialized data.
>>
>> As a result, if you crash before the extent tree is updated, when you
>> read from the file, you will get all zero's, instead of the data, thus
>> preventing the security leak.
>>
>> It does mean that fsync() is slightly slower, since we now have to
>> flush the data blocks out, wait for the completion handler to fire and
>> update the extent in the same jbd2 transaction, and only then wait for
>> the barrier in the jbd2 transaction.  (And in fact, I'm not sure
>> fsync() is completely working correctly in the current patch in the
>> unstable patch stream, and there aren't race conditions where the
>> extent tree update slips into the next transaction.)  But it does
>> solve the problem.
>>
>> The other downside with this solution is that it only works for files
>> that are extent-mapped, and if you do this with a converted ext3 file
>> system, and there are files that are still mapped using
>> direct/indirect blocks, when you change the mount option to be
>> data=writeback,dioread_nolock, the block allocating writes to these
>> legacy files could result in data getting exposed after a crash.
>>
>> Depending on the workload the upside is that by using data=writeback
>> instead of data=ordered could far outweigh the downside of needing to
>> do an extra block I/O queue flush before the fsync, since it reduces
>> the number of entangled writes to only the metadata blocks, where
>> previously the entagled write problem affected metadata blocks plus
>> all freshly allocated blocks.
>>
>> Kalias, this is something that I plan to look in the near future; if
>> you are interested in helping to benchmark and characterize this
>> solution, I'd be very interested in working with you.  Can you tell me
>> a little more about your use case and requirements?
>>
>>                                      - Ted
>

Jan and Ted, thank you very much for detailed replies.

We are assessing the use of copy-on-write technique to provide data
level consistency in EXT3/EXT4. We have implemented this in EXT3 by
using the Ordered mode of operation. Benchmark results for IOZone and
Postmark are quiet good. We could get the consistency equivalent to
Journal mode with the overhead almost same as Ordered mode. However,
there are few cases(for example, file rewrite) where performance of
Journal mode is better than our technique. We think that in EXT4, with
the support for delayed block allocation and extents, these problems
can be removed.

However, Ordered mode with delayed block allocation in EXT4 does not
behave in the same way as in EXT3. It does not flush 'all' dirty
blocks to the disk as in EXT3. For implementing our technique in EXT4,
we need EXT3 style Ordered mode, that is
alloc_on_commit(http://lwn.net/Articles/324023/).

I understand that this is not required in EXT4 since the Ordered mode
is provided for security and not consistency. However, from the
discussions on blogs/post, it seems that developers expect Ordered
mode to provide (limited) data consistency as well.

Since the implementation of our technique heavily depends on EXT3
style Ordered mode, I would like to implement alloc_on_commit on EXT4.
I have designed following strategy to address credit reservation
problem in earlier patch. Please let me know your comments on it.

1. In Write path, the call to journal_start() for updating metadata
will reserve credits for delayed allocation also.
2. If the fs is mounted with alloc_on_commit, journal_stop() will not
return remaining credits to the journal (t_outstanding_credits will
not be changed).
3. In journal_commit() -
i. After LOCKing the current transaction, a new special handle will be
created by calling journal_start() with zero credits . Such a call to
journal_start() can be treated as a special case for creating handle
to use accumulated credits (in t_outstanding_credits) of currently
locked transaction.
ii. Before changing transaction state to FLUSH, callback will be used
to perform delayed block allocation for all inodes. This mechanism
will be same as in alloc_on_commit at http://lwn.net/Articles/324023/
, but it will be performed after changing the transaction to LOCKED
state. In the callback, specially created handle will passed to the
callback function and it will use that handle for performing delayed
block allocation.
iii. The special handle will be closed, outstanding credits for
transaction will be zeroed and the transaction flush will continue.

Regarding dioread_nolock work:
Ted, I am new in filesystem development. If this is fine and your
deadlines are not very critical, I will be very happy to work with you
on dioread_nolock even though its not directly related to our current
work. Please let me know more on this.

Thanks & Regards,
Kailas
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html