Re: io_submit() blocks for writes for substantial amount of time

Tomasz Grabiec <tgrabiec@xxxxxxxxxxxx> · Tue, 19 Sep 2017 19:53:52 +0200



On Tue, Sep 19, 2017 at 7:38 PM, Brian Foster <bfoster@xxxxxxxxxx> wrote:
> On Tue, Sep 19, 2017 at 07:29:18PM +0300, Avi Kivity wrote:
>>
>>
>> On 09/19/2017 03:27 PM, Brian Foster wrote:
>> > On Tue, Sep 19, 2017 at 10:50:51AM +0200, Tomasz Grabiec wrote:
>> > > Hi,
>> > >
>> > > On some systems we are seeing one of our tests to trigger io_submit()
>> > > calls to block when submitting writes for an order of 100ms [1]. This
>> > > is problematic, because we heavily rely on io_submit() being async.
>> > >
>> > > Workload: open, (ftruncate, append*)*, close.
>> > >
>> > > Kernel version: 4.12.9-300.fc26.x86_64
>> > > mount: /dev/nvme0n1p3 on / type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
>> > >
>> > > The blocking happens in the following places:
>> > >
>> > > (1)
>> > >
>> > >              7fff9287472f __schedule ([kernel.kallsyms])
>> > >              7fff92874d16 schedule ([kernel.kallsyms])
>> > >              7fff92878d42 schedule_timeout ([kernel.kallsyms])
>> > >              7fff92876478 wait_for_completion ([kernel.kallsyms])
>> > >              7fffc05bf231 xfs_buf_submit_wait ([kernel.kallsyms])
>> > >              7fffc05bf3d3 _xfs_buf_read ([kernel.kallsyms])
>> > >              7fffc05bf4e4 xfs_buf_read_map ([kernel.kallsyms])
>> > >              7fffc05f53ca xfs_trans_read_buf_map ([kernel.kallsyms])
>> > >              7fffc058b432 xfs_btree_read_buf_block.constprop.34
>> > > ([kernel.kallsyms])
>> > >              7fffc058b504 xfs_btree_lookup_get_block ([kernel.kallsyms])
>> > >              7fffc058f6ad xfs_btree_lookup ([kernel.kallsyms])
>> > >              7fffc0570919 xfs_alloc_lookup_eq ([kernel.kallsyms])
>> > >              7fffc0570c59 xfs_alloc_fixup_trees ([kernel.kallsyms])
>> > >              7fffc0573a2d xfs_alloc_ag_vextent_near ([kernel.kallsyms])
>> > >              7fffc0573db1 xfs_alloc_ag_vextent ([kernel.kallsyms])
>> > >              7fffc05749cb xfs_alloc_vextent ([kernel.kallsyms])
>> > >              7fffc0585ba8 xfs_bmap_btalloc ([kernel.kallsyms])
>> > >              7fffc058605e xfs_bmap_alloc ([kernel.kallsyms])
>> > >              7fffc0586d6d xfs_bmapi_write ([kernel.kallsyms])
>> > >              7fffc05cedd1 xfs_iomap_write_direct ([kernel.kallsyms])
>> > >              7fffc05cf0ec xfs_file_iomap_begin ([kernel.kallsyms])
>> > >              7fff922d46ca iomap_apply ([kernel.kallsyms])
>> > >              7fff922d4dfb iomap_dio_rw ([kernel.kallsyms])
>> > >              7fffc05c4091 xfs_file_dio_aio_write ([kernel.kallsyms])
>> > >              7fffc05c456d xfs_file_write_iter ([kernel.kallsyms])
>> > >              7fff922bc5d3 aio_write ([kernel.kallsyms])
>> > >              7fff922bcec1 do_io_submit ([kernel.kallsyms])
>> > >              7fff922bdd40 sys_io_submit ([kernel.kallsyms])
>> > >              7fff9287a6b7 entry_SYSCALL_64_fastpath ([kernel.kallsyms])
>> > >                       687 io_submit (/usr/lib64/libaio.so.1.0.1)
>> > >                    112373 seastar::reactor::flush_pending_aio
>> > > (/home/tgrabiec/src/scylla/build/release/tests/perf/perf_fast_forward_g)
>> > So you have a direct I/O write that requires block allocation. Block
>> > allocation requires reading free space btree blocks to identify and fix
>> > up remaining free extent records based on the allocation.
>>
>> Will an fallocate() call before the write in another thread help?
>>
>
> Preallocating the file (or largish ranges) should help. I'm not sure
> preallocating the range of each and every write will have the behavior
> you want.
>
>> Will a write to a previously fallocate()d extent get blocked while
>> fallocate()ing a new extent?
>>
>
> Any dio can most likely block behind an fallocate call due to locking
> (just like any write that requires allocation can block behind another
> such write).
>
>> >
>> > > (2)
>> > >
>> > >    7fff9287472f __schedule ([kernel.kallsyms])
>> > >              7fff92874d16 schedule ([kernel.kallsyms])
>> > >              7fffc05e6265 _xfs_log_force ([kernel.kallsyms])
>> > >              7fffc05c2518 xfs_extent_busy_flush ([kernel.kallsyms])
>> > >              7fffc0572ccd xfs_alloc_ag_vextent_size ([kernel.kallsyms])
>> > >              7fffc0573d91 xfs_alloc_ag_vextent ([kernel.kallsyms])
>> > >              7fffc05749cb xfs_alloc_vextent ([kernel.kallsyms])
>> > >              7fffc0585ba8 xfs_bmap_btalloc ([kernel.kallsyms])
>> > >              7fffc058605e xfs_bmap_alloc ([kernel.kallsyms])
>> > >              7fffc0586d6d xfs_bmapi_write ([kernel.kallsyms])
>> > >              7fffc05cedd1 xfs_iomap_write_direct ([kernel.kallsyms])
>> > >              7fffc05cf0ec xfs_file_iomap_begin ([kernel.kallsyms])
>> > >              7fff922d46ca iomap_apply ([kernel.kallsyms])
>> > >              7fff922d4dfb iomap_dio_rw ([kernel.kallsyms])
>> > >              7fffc05c4091 xfs_file_dio_aio_write ([kernel.kallsyms])
>> > >              7fffc05c456d xfs_file_write_iter ([kernel.kallsyms])
>> > >              7fff922bc5d3 aio_write ([kernel.kallsyms])
>> > >              7fff922bcec1 do_io_submit ([kernel.kallsyms])
>> > >              7fff922bdd40 sys_io_submit ([kernel.kallsyms])
>> > >              7fff9287a6b7 entry_SYSCALL_64_fastpath ([kernel.kallsyms])
>> > >                       687 io_submit (/usr/lib64/libaio.so.1.0.1)
>> > >                    112373 seastar::reactor::flush_pending_aio
>> > > (/home/tgrabiec/src/scylla/build/release/tests/perf/perf_fast_forward_g)
>> > >
>> > Another dio write that requires allocation. The allocation finds a busy
>> > extent, which means the extent was recently freed but the associated
>> > freeing transaction has not yet made it to the on-disk log. As such it
>> > cannot be safely reused, so the allocator flushes the log and retries to
>> > try and clear the busy state and find an extent.
>>
>> Is that because the disk is nearly full and there are no known flushed
>> extents, or because the allocator doesn't prioritize known-flushed extents?
>> From your comments below I gather you may not know for sure.
>>
>
> I'm not sure without digging further into it. Hence the question around
> free space availability.

The file system was utilized between 90% and 95% out of 165GB during the test.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html