Re: [PATCH v2 2/2] ext4: fix slow writeback under dioread_nolock and nodelalloc

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jan 22, 2019 at 10:30 PM Theodore Y. Ts'o <tytso@xxxxxxx> wrote:
>
> On Sat, Dec 15, 2018 at 01:48:40PM +0800, Xiaoguang Wang wrote:
> > With "nodelalloc", blocks are allocated at the time of writing, and with
> > "dioread_nolock", these allocated blocks are marked as unwritten as well,
> > so bh(s) attached to the blocks have BH_Unwritten and BH_Mapped.
>
> I've been looking at your patches, and it seems that a simpler way,
> perhaps more maintainable approach in the long term is to change how
> we write to newly allocated blocks.  Today, we have two ways of doing
> this:
>
> 1) In the dioread_nolock case, we allocate blocks, insert an entry in
> the extent tree with the blocks marked uninitialized, write the
> blocks, and then mark the blocks initialized.
>
> 2) In the !dioread_nolock case, we allocate blocks, insert an entry to
> the extent tree, write the blocks --- and if we start a commit, we
> write out all dirty pages associated with that inode (in the default
> data=writeback case) to avoid stale writes.
>
> So what if we change the dioread_nolock case to do write the blocks
> first, and *then* insert the entry into the extent tree?  This avoids
> stale data getting exposed, either by a direct I/O read, or after a
> crash (which means we avoid needing to do the force write-out).
>
> So what we would need to do is to pass a flag to ext4_map_blocks()
> which causes it to *not* make any on-disk changes.  Instead, it would
> track the fact that blocks have be reserved in the buddy bitmap (this
> is how we prevent blocks from being preallocated after they are
> deleted, but before the transaction has been committed), and the
> location of the assigned blocks in the extent_status tree.  Since no
> on-disk changes are being made, we wouldn't need to hold the
> transaction open.
>
> Then in the callback after the blocks are written, using the starting
> logical block number stored in the io_end structure, we either convert
> the unwritten extents or actually insert the newly allocated blocks in
> the extent tree and update the on-disk bitmap allocation bitmaps.
>
> Once we get this working, it should be easy to make dioread_nolock for
> 1k block sizes; it keeps the time that the handle open very short; and
> it completely obviates the need for data=writeback.
>
> What do folks think?
>

So that (reserve, write, insert extent records) is basically what
btrfs is doing and I feel like it will work better than the current
way.

My only concern is performance since metadata reservation for delalloc
now becomes more and needs to be carried until endio, a perf. spike
would appear if the foreground writer needs to wait for flushing dirty
pages to reclaim metadata credits.

thanks,
liubo



[Index of Archives]     [Reiser Filesystem Development]     [Ceph FS]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite National Park]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Media]

  Powered by Linux