Re: sleeps and waits during io_submit

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Dave


On Mon, Nov 30, 2015 at 6:10 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Fri, Nov 27, 2015 at 09:43:50PM -0500, Glauber Costa wrote:
>> Hello my dear XFSers,
>>
>> For those of you who don't know, we at ScyllaDB produce a modern NoSQL
>> data store that, at the moment, runs on top of XFS only. We deal
>> exclusively with asynchronous and direct IO, due to our
>> thread-per-core architecture. Due to that, we avoid issuing any
>> operation that will sleep.
>>
>> While debugging an extreme case of bad performance (most likely
>> related to a not-so-great disk), I have found a variety of cases in
>> which XFS blocks. To find those, I have used perf record -e
>> sched:sched_switch -p <pid_of_db>, and I am attaching the perf report
>> as xfs-sched_switch.log. Please note that this doesn't tell me for how
>> long we block, but as mentioned before, blocking operations outside
>> our control are detrimental to us regardless of the elapsed time.
>>
>> For those who are not acquainted to our internals, please ignore
>> everything in that file but the xfs functions. For the xfs symbols,
>> there are two kinds of events: the ones that are a children of
>> io_submit, where we don't tolerate blocking, and the ones that are
>> children of our helper IO thread, to where we push big operations that
>> we know will block until we can get rid of them all. We care about the
>> former and ignore the latter.
>>
>> Please allow me to ask you a couple of questions about those findings.
>> If we are doing anything wrong, advise on best practices is truly
>> welcome.
>>
>> 1) xfs_buf_lock -> xfs_log_force.
>>
>> I've started wondering what would make xfs_log_force sleep. But then I
>> have noticed that xfs_log_force will only be called when a buffer is
>> marked stale. Most of the times a buffer is marked stale seems to be
>> due to errors. Although that is not my case (more on that), it got me
>> thinking that maybe the right thing to do would be to avoid hitting
>> this case altogether?
>
> The buffer is stale because it has recently been freed, and we
> cannot re-use it until the transaction that freed it has been
> committed to the journal. e.g. this trace:
>
>    --3.15%-- _xfs_log_force
>              xfs_log_force
>              xfs_buf_lock
>              _xfs_buf_find
>              xfs_buf_get_map
>              xfs_trans_get_buf_map
>              xfs_btree_get_bufl
>              xfs_bmap_extents_to_btree
>              xfs_bmap_add_extent_hole_real
>              xfs_bmapi_write
>              xfs_iomap_write_direct
>              __xfs_get_blocks
>              xfs_get_blocks_direct
>              do_blockdev_direct_IO
>              __blockdev_direct_IO
>              xfs_vm_direct_IO
>              xfs_file_dio_aio_write
>              xfs_file_write_iter
>              aio_run_iocb
>              do_io_submit
>              sys_io_submit
>              entry_SYSCALL_64_fastpath
>              io_submit
>              0x46d98a
>
> implies something like this has happened:
>
> truncate
>   free extent
>     extent list now fits inline in inode
>     btree-to-extent format change
>       free last bmbt block X
>         mark block contents stale
>         add block to busy extent list
>         place block on AGFL
>
> AIO write
>   allocate extent
>     inline extents full
>     extent-to-btree conversion
>         allocate bmbt block
>           grab block X from free list
>           get locked buffer for block X
>             xfs_buf_lock
>               buffer stale
>                 log force to commit previous free transaction to disk
>                 .....
>                 log force completes
>                   buffer removed from busy extent list
>                   buffer no longer stale
>          add bmbt record to new block
>          update btree indexes
>          ....
>
>
> And, looking at the trace you attached, we see this:
>
> dump_stack
> xfs_buf_stale
> xfs_trans_binval
> xfs_bmap_btree_to_extents               <<<<<<<<<
> xfs_bunmapi
> xfs_itruncate_extents
>
> So I'd say this is a pretty clear indication that we're immediately
> recycling freed blocks from the free list here....
>
>> The file example-stale.txt contains a backtrace of the case where we
>> are being marked as stale. It seems to be happening when we convert
>> the the inode's extents from unwritten to real. Can this case be
>> avoided? I won't pretend I know the intricacies of this, but couldn't
>> we be keeping extents from the very beginning to avoid creating stale
>> buffers?
>>
>> 2) xfs_buf_lock -> down
>> This is one I truly don't understand. What can be causing contention
>> in this lock? We never have two different cores writing to the same
>> buffer, nor should we have the same core doingCAP_FOWNER so.
>
> As Brian pointed out, this is probably AGF or AGI contention -
> attempting to allocate/free extents or inodes in the same AG at the
> same time will show this sort of pattern. This trace shows AGF
> contention:
>
>   down
>   xfs_buf_lock
>   _xfs_buf_find
>   xfs_buf_get_map
>   xfs_buf_read_map
>   xfs_trans_read_buf_map
>   xfs_read_agf
>   .....
>
> This trace shows AGI contention:
>
>   down
>   xfs_buf_lock
>   _xfs_buf_find
>   xfs_buf_get_map
>   xfs_buf_read_map
>   xfs_trans_read_buf_map
>   xfs_read_agi
>   ....
>

Great. I will take a look at how can we mitigate those on our side. I
will need some time to understand all of that better, so for now I'd
just leave you guys with a big thank you.

>> 3) xfs_file_aio_write_checks -> file_update_time -> xfs_vn_update_time
>
> This trace?:
>
>  rwsem_down_write_failed
>  call_rwsem_down_write_failed
>  xfs_ilock
>  xfs_vn_update_time
>  file_update_time
>  xfs_file_aio_write_checks
>  xfs_file_dio_aio_write
>  xfs_file_write_iter
>  aio_run_iocb
>  do_io_submit
>  sys_io_submit
>  entry_SYSCALL_64_fastpath
>  io_submit
>  0x46d98a
>
> Which is an mtime timestamp update racing with another operation
> that takes the internal metadata lock (e.g. block mapping/allocation
> for that inode).
>
>> You guys seem to have an interface to avoid that, by setting the
>> FMODE_NOCMTIME flag.
>
> We've talked about exposing this through open() for Ceph.
>
> http://www.kernelhub.org/?p=2&msg=744325
> https://lkml.org/lkml/2015/5/15/671
>
> Read the first thread for why it's problematic to expose this to
> userspace - I won't repeat it all here.
>
> As it is, there was some code recently hacked into ext4 to reduce
> mtime overhead - the MS_LAZYTIME superblock option. What it does it
> prevent the inode from being marked dirty when timestamps are
> updated and hence the timestamps are never journalled or written
> until something else marks the inode metadata dirty (e.g. block
> allocation). ext4 gets away with this because it doesn't actually
> journal timestamp changes - they get captured in the journal by
> other modifications that are journalled, but still rely on the inod
> being marked dirty for fsync, writeback and inode cache eviction
> doing the right thing.
>
> The ext4 implementation looks like the timestamp updates can be
> thrown away, as the inodes are not marked dirty and so on memory
> pressure they will simply be reclaimed without writing back the
> updated timestamps that are held in memory. I suspect fsync will
> also have problems on ext4 as the inode is not metadata dirty or
> journalled, and hence the timestamp changes will never get written
> back.
>
> And, IMO, the worst part of the ext4 implementation is that the
> inode buffer writeback code in ext4 now checks to see if any of the
> other inodes in the buffer being written back need to have their
> inode timestamps updated. IOWs, ext4 now does writeback of
> /unjournalled metadata/ to inodes that are purely timestamp dirty.
>
> We /used/ to do shit like this in XFS. We got rid of it in
> preference of journalling everything because the corner cases in log
> recovery meant that after a crash the inodes were in inconsistent
> states, and that meant we had unexpected, unpredictable recovery
> behaviour where files weren't the expected size and/or didn't
> contain the expected data. Hence going back to the bad old days of
> hacking around the journal "for speed" doesn't exactly fill me with
> joy.
>
> Let me have a think about how we can implement lazytime in a sane
> way, such that fsync() works correctly, we don't throw away
> timstamp changes in memory reclaim and we don't write unlogged
> changes to the on-disk locations....

I trust you fully for matters related to speed.

Keep in mind, though, that at least for us the fact that it blocks is
a lot worse than the fact that it is slow. We can work around slow,
but blocking basically means that we won't have any more work to push
- since we don't do threading. The processor that stales just sits
idle until the lock is released. So any non-blocking solution to this
would already be a win for us.



>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs



[Index of Archives]     [Linux XFS Devel]     [Linux Filesystem Development]     [Filesystem Testing]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux