On 2021/8/3 2:23, Jaegeuk Kim wrote:
On 08/02, Chao Yu wrote:
On 2021/8/2 12:39, Eric Biggers wrote:
On Fri, Jul 30, 2021 at 10:46:16PM -0400, Theodore Ts'o wrote:
On Fri, Jul 30, 2021 at 12:17:26PM -0700, Eric Biggers wrote:
Currently, non-overwrite DIO writes are fundamentally unsafe on f2fs as
they require preallocating blocks, but f2fs doesn't support unwritten
blocks and therefore has to preallocate the blocks as regular blocks.
f2fs has no way to reliably roll back such preallocations, so as a
result, f2fs will leak uninitialized blocks to users if a DIO write
doesn't fully complete.
There's another way of solving this problem which doesn't require
supporting unwritten blocks. What a file system *could* do is to
allocate the blocks, but *not* update the on-disk data structures ---
so the allocation happens in memory only, so you know that the
physical blocks won't get used for another files, and then issue the
data block writes. On the block I/O completion, trigger a workqueue
function which updates the on-disk metadata to assign physical blocks
to the inode.
That way if you crash before the data I/O has a chance to complete,
the on-disk logical block -> physical block map hasn't been updated
yet, and so you don't need to worry about leaking uninitialized blocks.
Thanks for your suggestion, I think it makes sense.
Cheers,
- Ted
Jaegeuk and Chao, any idea how feasible it would be for f2fs to do this?
Firstly, let's notice that below metadata will be touched during DIO
preallocation flow:
- log header
- sit bitmap/count
- free seg/sec bitmap/count
- dirty seg/sec bitmap/count
And there is one case we need to concern about is: checkpoint() can be
triggered randomly in between dio_preallocate() and dio_end_io(), we should
not persist any DIO preallocation related metadata during checkpoint(),
otherwise, sudden power-cut after the checkpoint will corrupt filesytem.
So it needs to well separate two kinds of metadata update:
a) belong to dio preallocation
b) the left one
After that, it will simply checkpoint() flow to just flush metadata b), for
other flow, like GC, data/node allocation, it needs to query/update metadata
after we combine metadata a) and b).
In addition, there is an existing in-memory log header framework in f2fs,
based on this fwk, it's very easy for us to add a new in-memory log header
for DIO preallocation.
So it seems feasible for me until now...
Jaegeuk, any other concerns about the implementation details?
Hmm, I'm still trying to deal with this as a corner case where the writes
haven't completed due to an error. How about keeping the preallocated block
offsets and releasing them if we get an error? Do we need to handle EIO right?
What about the case that CP + SPO following DIO preallocation? User will
encounter uninitialized block after recovery.
Thanks,
Thanks,
- Eric