Re: [RFC] ext4: Semantics of delalloc,data=ordered

Andreas Dilger <adilger@xxxxxxx> · Mon, 16 Jun 2008 12:55:24 -0600

On Jun 16, 2008  17:05 +0200, Jan Kara wrote:
>   First, I'd like to see some short comment on what semantics
> delalloc,data=ordered is going to have. At least I can imagine at least
> two sensible approaches:
>   1) All we guarantee is that user is not going to see uninitialized data.
> We send writes to disk (and allocate blocks) whenever it fits our needs
> (usually when pdflush finds them).
>   2) We guarantee that when transaction commits, your data is on disk -
> i.e., we allocate actual blocks on transaction commit.
> 
>   Both these possibilities have their pros and cons. Most importantly,
> 1) gives better disk layout while 2) gives higher consistency
> guarantees. Note that with 1), it can under some circumstances happen,
> that after a crash you see block 1 and 3 of your 3-block-write on disk,
> while block 2 is still a hole. 1) is easy to implement (you mostly did
> it below), 2) is harder. I think there should be broader consensus on
> what the semantics should be (changed subject to catch more attention ;).

IMHO, the semantic should be (1) and not (2).  Applications don't understand
"when the transaction commits" so it doesn't provide any useful guarantee
to userspace, and if they actually need the data on disk (e.g. MTA) then
they need to call fsync to ensure this.

While I agree it is theoretically possible to have the "hole in data
where there shouldn't be one" scenario, in real life these blocks would be
allocated together by delalloc+mballoc and this situation should not happen.

As for "sync with heavy IO causing slowness" problem of Firefox, I think
that delalloc will help this noticably, but I agree we can still get into
cases where a lot of dirty data was just allocated and now needs to be
flushed to disk to commit the transaction.

In the short term I don't think this can be completely fixed, but in the
long term I think it can be fixed by having mballoc do "reservations" of
space on disk, in which the dirty pages can be written.  Only after the
data is on disk does the "reservation" turn into an "allocation" in the
journal (i.e. filesystem buffers added to transaction and modified).
At that point a sync operation only has to write out the journal blocks,
because all of the data is on disk already.

I don't think it is a huge difference from what we have today, but I
also don't think it should be in the first implementation.  We would
need to split up handling of the in-memory block bitmaps so that only
the in-memory ones are updated first, then the on-disk bitmaps are
later marked in use in a transaction after the data blocks are on disk.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html