hi,
On Wed 23-01-19 01:27:38, Theodore Y. Ts'o wrote:
On Sat, Dec 15, 2018 at 01:48:40PM +0800, Xiaoguang Wang wrote:
With "nodelalloc", blocks are allocated at the time of writing, and with
"dioread_nolock", these allocated blocks are marked as unwritten as well,
so bh(s) attached to the blocks have BH_Unwritten and BH_Mapped.
I've been looking at your patches, and it seems that a simpler way,
perhaps more maintainable approach in the long term is to change how
we write to newly allocated blocks. Today, we have two ways of doing
this:
1) In the dioread_nolock case, we allocate blocks, insert an entry in
the extent tree with the blocks marked uninitialized, write the
blocks, and then mark the blocks initialized.
2) In the !dioread_nolock case, we allocate blocks, insert an entry to
the extent tree, write the blocks --- and if we start a commit, we
write out all dirty pages associated with that inode (in the default
data=writeback case) to avoid stale writes.
So what if we change the dioread_nolock case to do write the blocks
first, and *then* insert the entry into the extent tree? This avoids
stale data getting exposed, either by a direct I/O read, or after a
crash (which means we avoid needing to do the force write-out).
So what we would need to do is to pass a flag to ext4_map_blocks()
which causes it to *not* make any on-disk changes. Instead, it would
track the fact that blocks have be reserved in the buddy bitmap (this
is how we prevent blocks from being preallocated after they are
deleted, but before the transaction has been committed), and the
location of the assigned blocks in the extent_status tree. Since no
on-disk changes are being made, we wouldn't need to hold the
transaction open.
Then in the callback after the blocks are written, using the starting
logical block number stored in the io_end structure, we either convert
the unwritten extents or actually insert the newly allocated blocks in
the extent tree and update the on-disk bitmap allocation bitmaps.
Once we get this working, it should be easy to make dioread_nolock for
1k block sizes; it keeps the time that the handle open very short; and
it completely obviates the need for data=writeback.
What do folks think?
Hum, so there is the problem that adding extent to the extent tree may need
some block allocations for metadata. So we'd have to carry over delalloc
reservations upto the io-end time. But that should be doable, just needs
some work. Also in the dioread_nolock case we don't have problems with
page lock & page writeback vs transaction start deadlocks as I've described
in my another email regarding ext4_writepages(). So I don't see any hole in
this and the performance should be good. I like this!
First sorry for late response, I was busy with some internal work.
I really like your ideas, thanks both of you. Writing dirty pages while holding
open jbd2 handle will result in some issues, for example, system load will be high,
many tasks in the D state, some of them is stuck in wait_transaction_locked().
Currently I have tried some methods to fix this issue, for example, change
MAX_WRITEPAGES_EXTENT_LEN to 256, then we can make sure that one bio will
be generated at most, and this bio will be submitted after journal stop:
if (!ext4_handle_valid(handle) || handle->h_sync == 0) {
ext4_journal_stop(handle);
handle = NULL;
mpd.do_map = 0;
}
/* Submit prepared bio */
ext4_io_submit(&mpd.io_submit);
After this change, the number of tasks in D state will decrease much,
Of course, my method is not good, I just say writing diry pages while holding
open jbd2 handle is not good :) and I believe that your ideas can fix this kind
of problems, thanks again.
Regards,
Xiaoguang Wang
Regards,
Xiaoguang Wang
Honza