Re: [PATCH,RFC 7/7] ext4: move ext4_journal_start/stop to mpage_da_map_and_submit()

Amir Goldstein <amir73il@xxxxxxxxx> · Fri, 18 Feb 2011 12:42:13 +0200

On Fri, Feb 18, 2011 at 6:23 AM, Ted Ts'o <tytso@xxxxxxx> wrote:
> On Sat, Feb 12, 2011 at 07:15:57PM -0500, Theodore Ts'o wrote:
>> Previously, ext4_da_writepages() was responsible for calling
>> ext4_journal_start() and ext4_journal_stop().  If the blocks had
>> already been allocated (we don't support journal=data in
>> ext4_da_writepages), then there's no need to start a new journal
>> handle.
>>
>> By moving ext4_journal_start/stop calls to mpage_da_map_and_submit()
>> we should significantly reduce the cpu usage (and cache line bouncing)
>> if the journal is enabled.  This should (hopefully!) be especially
>> noticeable on large SMP systems.
>>
>> Signed-off-by: "Theodore Ts'o" <tytso@xxxxxxx>
>
> Argh, it turns out this doesn't work.  I was getting sporadic
> deadlocks and I finally figured out the problem.  If a process is
> holding page locks, it can't call ext4_journal_start() safely in
> data=ordered, since there's a chance that there won't be enough
> transaction credits and a new transaction will be started.  And at
> that point, in data=ordered mode, we may end up calling
> journal_submit_inode_data_buffers(), which could try to write back the
> inode pages in question --- which are already locked.
>
> This means that we need to start the journal handle long before we
> know whether or not we really need it.  Boo, hiss!
>
> The only way to solve this problem is to do what I've been planning
> all for a while, which is to add support in ext4_map_blocks() for a
> mode where it will allocate a region of blocks, but *not* update the
> extent map.  It will have to store the allocation in an in-memory
> cache, so that if other CPU's try to request a logical block, it will
> get the right answer.  However, the actual on-disk extent map can't be
> updated until *after* the data is safely written on disk (and the
> pages can thus be unlocked).
>
> Once we do that, we'll also be able to ditch ordered mode for good,
> since it means that there won't be any chance of stale data being
> revealed, without any of performance disasters involved with
> data=ordered mode.
>
> I have no idea what these changes will do to Amir's snapshot plans,
> but sorry, getting this right is going to be higher priority.

If anything, memory-only data allocations would be a great contribution
to extent data move-on-write :-)

It would allow me to split the extent in-memory and defer the decision,
whether to split the extent on-disk or wait for copy-on-write to complete,
to data writeback time.

By that time, async copy-on-write sequence may have already completed
and fragmentation can be avoided.

If you are looking for someone to execute your plan, or write some
experimental code, I think that Yongqiang would be up for the task
(hope that's OK with Yongqiang)

>
> I may end up submitting the rest of this patch series without this
> last patch, since it does clean up the code paths a lot, and it should
> result in a few small performance improvements --- the big performance
> improvement, found in this patch, we'll have to skip until we can fix
> up the writeback submission.
>
>                                       - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html