Re: sleeps and waits during io_submit

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 1 Dec 2015 10:10:27 +1100

On Fri, Nov 27, 2015 at 09:43:50PM -0500, Glauber Costa wrote:
> Hello my dear XFSers,
> 
> For those of you who don't know, we at ScyllaDB produce a modern NoSQL
> data store that, at the moment, runs on top of XFS only. We deal
> exclusively with asynchronous and direct IO, due to our
> thread-per-core architecture. Due to that, we avoid issuing any
> operation that will sleep.
> 
> While debugging an extreme case of bad performance (most likely
> related to a not-so-great disk), I have found a variety of cases in
> which XFS blocks. To find those, I have used perf record -e
> sched:sched_switch -p <pid_of_db>, and I am attaching the perf report
> as xfs-sched_switch.log. Please note that this doesn't tell me for how
> long we block, but as mentioned before, blocking operations outside
> our control are detrimental to us regardless of the elapsed time.
> 
> For those who are not acquainted to our internals, please ignore
> everything in that file but the xfs functions. For the xfs symbols,
> there are two kinds of events: the ones that are a children of
> io_submit, where we don't tolerate blocking, and the ones that are
> children of our helper IO thread, to where we push big operations that
> we know will block until we can get rid of them all. We care about the
> former and ignore the latter.
> 
> Please allow me to ask you a couple of questions about those findings.
> If we are doing anything wrong, advise on best practices is truly
> welcome.
> 
> 1) xfs_buf_lock -> xfs_log_force.
> 
> I've started wondering what would make xfs_log_force sleep. But then I
> have noticed that xfs_log_force will only be called when a buffer is
> marked stale. Most of the times a buffer is marked stale seems to be
> due to errors. Although that is not my case (more on that), it got me
> thinking that maybe the right thing to do would be to avoid hitting
> this case altogether?

The buffer is stale because it has recently been freed, and we
cannot re-use it until the transaction that freed it has been
committed to the journal. e.g. this trace:

   --3.15%-- _xfs_log_force
             xfs_log_force
             xfs_buf_lock
             _xfs_buf_find
             xfs_buf_get_map
             xfs_trans_get_buf_map
             xfs_btree_get_bufl
             xfs_bmap_extents_to_btree
             xfs_bmap_add_extent_hole_real
             xfs_bmapi_write
             xfs_iomap_write_direct
             __xfs_get_blocks
             xfs_get_blocks_direct
             do_blockdev_direct_IO
             __blockdev_direct_IO
             xfs_vm_direct_IO
             xfs_file_dio_aio_write
             xfs_file_write_iter
             aio_run_iocb
             do_io_submit
             sys_io_submit
             entry_SYSCALL_64_fastpath
             io_submit
             0x46d98a

implies something like this has happened:

truncate
  free extent
    extent list now fits inline in inode
    btree-to-extent format change
      free last bmbt block X
        mark block contents stale
	add block to busy extent list
	place block on AGFL

AIO write
  allocate extent
    inline extents full
    extent-to-btree conversion
        allocate bmbt block
	  grab block X from free list
	  get locked buffer for block X
	    xfs_buf_lock
	      buffer stale
	        log force to commit previous free transaction to disk
		.....
		log force completes
		  buffer removed from busy extent list
		  buffer no longer stale
	 add bmbt record to new block
	 update btree indexes
	 ....

And, looking at the trace you attached, we see this:

dump_stack
xfs_buf_stale
xfs_trans_binval
xfs_bmap_btree_to_extents		<<<<<<<<<
xfs_bunmapi
xfs_itruncate_extents

So I'd say this is a pretty clear indication that we're immediately
recycling freed blocks from the free list here....

> The file example-stale.txt contains a backtrace of the case where we
> are being marked as stale. It seems to be happening when we convert
> the the inode's extents from unwritten to real. Can this case be
> avoided? I won't pretend I know the intricacies of this, but couldn't
> we be keeping extents from the very beginning to avoid creating stale
> buffers?
> 
> 2) xfs_buf_lock -> down
> This is one I truly don't understand. What can be causing contention
> in this lock? We never have two different cores writing to the same
> buffer, nor should we have the same core doingCAP_FOWNER so.

As Brian pointed out, this is probably AGF or AGI contention -
attempting to allocate/free extents or inodes in the same AG at the
same time will show this sort of pattern. This trace shows AGF
contention:

  down
  xfs_buf_lock
  _xfs_buf_find
  xfs_buf_get_map
  xfs_buf_read_map
  xfs_trans_read_buf_map
  xfs_read_agf
  .....

This trace shows AGI contention:

  down
  xfs_buf_lock
  _xfs_buf_find
  xfs_buf_get_map
  xfs_buf_read_map
  xfs_trans_read_buf_map
  xfs_read_agi
  ....

> 3) xfs_file_aio_write_checks -> file_update_time -> xfs_vn_update_time

This trace?:

 rwsem_down_write_failed
 call_rwsem_down_write_failed
 xfs_ilock
 xfs_vn_update_time
 file_update_time
 xfs_file_aio_write_checks
 xfs_file_dio_aio_write
 xfs_file_write_iter
 aio_run_iocb
 do_io_submit
 sys_io_submit
 entry_SYSCALL_64_fastpath
 io_submit
 0x46d98a

Which is an mtime timestamp update racing with another operation
that takes the internal metadata lock (e.g. block mapping/allocation
for that inode).

> You guys seem to have an interface to avoid that, by setting the
> FMODE_NOCMTIME flag.

We've talked about exposing this through open() for Ceph.

http://www.kernelhub.org/?p=2&msg=744325
https://lkml.org/lkml/2015/5/15/671

Read the first thread for why it's problematic to expose this to
userspace - I won't repeat it all here.

As it is, there was some code recently hacked into ext4 to reduce
mtime overhead - the MS_LAZYTIME superblock option. What it does it
prevent the inode from being marked dirty when timestamps are
updated and hence the timestamps are never journalled or written
until something else marks the inode metadata dirty (e.g. block
allocation). ext4 gets away with this because it doesn't actually
journal timestamp changes - they get captured in the journal by
other modifications that are journalled, but still rely on the inod
being marked dirty for fsync, writeback and inode cache eviction
doing the right thing.

The ext4 implementation looks like the timestamp updates can be
thrown away, as the inodes are not marked dirty and so on memory
pressure they will simply be reclaimed without writing back the
updated timestamps that are held in memory. I suspect fsync will
also have problems on ext4 as the inode is not metadata dirty or
journalled, and hence the timestamp changes will never get written
back.

And, IMO, the worst part of the ext4 implementation is that the
inode buffer writeback code in ext4 now checks to see if any of the
other inodes in the buffer being written back need to have their
inode timestamps updated. IOWs, ext4 now does writeback of
/unjournalled metadata/ to inodes that are purely timestamp dirty.

We /used/ to do shit like this in XFS. We got rid of it in
preference of journalling everything because the corner cases in log
recovery meant that after a crash the inodes were in inconsistent
states, and that meant we had unexpected, unpredictable recovery
behaviour where files weren't the expected size and/or didn't
contain the expected data. Hence going back to the bad old days of
hacking around the journal "for speed" doesn't exactly fill me with
joy.

Let me have a think about how we can implement lazytime in a sane
way, such that fsync() works correctly, we don't throw away
timstamp changes in memory reclaim and we don't write unlogged
changes to the on-disk locations....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs