On Tue, Jan 22, 2019 at 10:30 PM Theodore Y. Ts'o <tytso@xxxxxxx> wrote: > > On Sat, Dec 15, 2018 at 01:48:40PM +0800, Xiaoguang Wang wrote: > > With "nodelalloc", blocks are allocated at the time of writing, and with > > "dioread_nolock", these allocated blocks are marked as unwritten as well, > > so bh(s) attached to the blocks have BH_Unwritten and BH_Mapped. > > I've been looking at your patches, and it seems that a simpler way, > perhaps more maintainable approach in the long term is to change how > we write to newly allocated blocks. Today, we have two ways of doing > this: > > 1) In the dioread_nolock case, we allocate blocks, insert an entry in > the extent tree with the blocks marked uninitialized, write the > blocks, and then mark the blocks initialized. > > 2) In the !dioread_nolock case, we allocate blocks, insert an entry to > the extent tree, write the blocks --- and if we start a commit, we > write out all dirty pages associated with that inode (in the default > data=writeback case) to avoid stale writes. > > So what if we change the dioread_nolock case to do write the blocks > first, and *then* insert the entry into the extent tree? This avoids > stale data getting exposed, either by a direct I/O read, or after a > crash (which means we avoid needing to do the force write-out). > > So what we would need to do is to pass a flag to ext4_map_blocks() > which causes it to *not* make any on-disk changes. Instead, it would > track the fact that blocks have be reserved in the buddy bitmap (this > is how we prevent blocks from being preallocated after they are > deleted, but before the transaction has been committed), and the > location of the assigned blocks in the extent_status tree. Since no > on-disk changes are being made, we wouldn't need to hold the > transaction open. > > Then in the callback after the blocks are written, using the starting > logical block number stored in the io_end structure, we either convert > the unwritten extents or actually insert the newly allocated blocks in > the extent tree and update the on-disk bitmap allocation bitmaps. > > Once we get this working, it should be easy to make dioread_nolock for > 1k block sizes; it keeps the time that the handle open very short; and > it completely obviates the need for data=writeback. > > What do folks think? > So that (reserve, write, insert extent records) is basically what btrfs is doing and I feel like it will work better than the current way. My only concern is performance since metadata reservation for delalloc now becomes more and needs to be carried until endio, a perf. spike would appear if the foreground writer needs to wait for flushing dirty pages to reclaim metadata credits. thanks, liubo