On Fri, Nov 27, 2015 at 09:43:50PM -0500, Glauber Costa wrote: > Hello my dear XFSers, > > For those of you who don't know, we at ScyllaDB produce a modern NoSQL > data store that, at the moment, runs on top of XFS only. We deal > exclusively with asynchronous and direct IO, due to our > thread-per-core architecture. Due to that, we avoid issuing any > operation that will sleep. > > While debugging an extreme case of bad performance (most likely > related to a not-so-great disk), I have found a variety of cases in > which XFS blocks. To find those, I have used perf record -e > sched:sched_switch -p <pid_of_db>, and I am attaching the perf report > as xfs-sched_switch.log. Please note that this doesn't tell me for how > long we block, but as mentioned before, blocking operations outside > our control are detrimental to us regardless of the elapsed time. > > For those who are not acquainted to our internals, please ignore > everything in that file but the xfs functions. For the xfs symbols, > there are two kinds of events: the ones that are a children of > io_submit, where we don't tolerate blocking, and the ones that are > children of our helper IO thread, to where we push big operations that > we know will block until we can get rid of them all. We care about the > former and ignore the latter. > > Please allow me to ask you a couple of questions about those findings. > If we are doing anything wrong, advise on best practices is truly > welcome. > > 1) xfs_buf_lock -> xfs_log_force. > > I've started wondering what would make xfs_log_force sleep. But then I > have noticed that xfs_log_force will only be called when a buffer is > marked stale. Most of the times a buffer is marked stale seems to be > due to errors. Although that is not my case (more on that), it got me > thinking that maybe the right thing to do would be to avoid hitting > this case altogether? The buffer is stale because it has recently been freed, and we cannot re-use it until the transaction that freed it has been committed to the journal. e.g. this trace: --3.15%-- _xfs_log_force xfs_log_force xfs_buf_lock _xfs_buf_find xfs_buf_get_map xfs_trans_get_buf_map xfs_btree_get_bufl xfs_bmap_extents_to_btree xfs_bmap_add_extent_hole_real xfs_bmapi_write xfs_iomap_write_direct __xfs_get_blocks xfs_get_blocks_direct do_blockdev_direct_IO __blockdev_direct_IO xfs_vm_direct_IO xfs_file_dio_aio_write xfs_file_write_iter aio_run_iocb do_io_submit sys_io_submit entry_SYSCALL_64_fastpath io_submit 0x46d98a implies something like this has happened: truncate free extent extent list now fits inline in inode btree-to-extent format change free last bmbt block X mark block contents stale add block to busy extent list place block on AGFL AIO write allocate extent inline extents full extent-to-btree conversion allocate bmbt block grab block X from free list get locked buffer for block X xfs_buf_lock buffer stale log force to commit previous free transaction to disk ..... log force completes buffer removed from busy extent list buffer no longer stale add bmbt record to new block update btree indexes .... And, looking at the trace you attached, we see this: dump_stack xfs_buf_stale xfs_trans_binval xfs_bmap_btree_to_extents <<<<<<<<< xfs_bunmapi xfs_itruncate_extents So I'd say this is a pretty clear indication that we're immediately recycling freed blocks from the free list here.... > The file example-stale.txt contains a backtrace of the case where we > are being marked as stale. It seems to be happening when we convert > the the inode's extents from unwritten to real. Can this case be > avoided? I won't pretend I know the intricacies of this, but couldn't > we be keeping extents from the very beginning to avoid creating stale > buffers? > > 2) xfs_buf_lock -> down > This is one I truly don't understand. What can be causing contention > in this lock? We never have two different cores writing to the same > buffer, nor should we have the same core doingCAP_FOWNER so. As Brian pointed out, this is probably AGF or AGI contention - attempting to allocate/free extents or inodes in the same AG at the same time will show this sort of pattern. This trace shows AGF contention: down xfs_buf_lock _xfs_buf_find xfs_buf_get_map xfs_buf_read_map xfs_trans_read_buf_map xfs_read_agf ..... This trace shows AGI contention: down xfs_buf_lock _xfs_buf_find xfs_buf_get_map xfs_buf_read_map xfs_trans_read_buf_map xfs_read_agi .... > 3) xfs_file_aio_write_checks -> file_update_time -> xfs_vn_update_time This trace?: rwsem_down_write_failed call_rwsem_down_write_failed xfs_ilock xfs_vn_update_time file_update_time xfs_file_aio_write_checks xfs_file_dio_aio_write xfs_file_write_iter aio_run_iocb do_io_submit sys_io_submit entry_SYSCALL_64_fastpath io_submit 0x46d98a Which is an mtime timestamp update racing with another operation that takes the internal metadata lock (e.g. block mapping/allocation for that inode). > You guys seem to have an interface to avoid that, by setting the > FMODE_NOCMTIME flag. We've talked about exposing this through open() for Ceph. http://www.kernelhub.org/?p=2&msg=744325 https://lkml.org/lkml/2015/5/15/671 Read the first thread for why it's problematic to expose this to userspace - I won't repeat it all here. As it is, there was some code recently hacked into ext4 to reduce mtime overhead - the MS_LAZYTIME superblock option. What it does it prevent the inode from being marked dirty when timestamps are updated and hence the timestamps are never journalled or written until something else marks the inode metadata dirty (e.g. block allocation). ext4 gets away with this because it doesn't actually journal timestamp changes - they get captured in the journal by other modifications that are journalled, but still rely on the inod being marked dirty for fsync, writeback and inode cache eviction doing the right thing. The ext4 implementation looks like the timestamp updates can be thrown away, as the inodes are not marked dirty and so on memory pressure they will simply be reclaimed without writing back the updated timestamps that are held in memory. I suspect fsync will also have problems on ext4 as the inode is not metadata dirty or journalled, and hence the timestamp changes will never get written back. And, IMO, the worst part of the ext4 implementation is that the inode buffer writeback code in ext4 now checks to see if any of the other inodes in the buffer being written back need to have their inode timestamps updated. IOWs, ext4 now does writeback of /unjournalled metadata/ to inodes that are purely timestamp dirty. We /used/ to do shit like this in XFS. We got rid of it in preference of journalling everything because the corner cases in log recovery meant that after a crash the inodes were in inconsistent states, and that meant we had unexpected, unpredictable recovery behaviour where files weren't the expected size and/or didn't contain the expected data. Hence going back to the bad old days of hacking around the journal "for speed" doesn't exactly fill me with joy. Let me have a think about how we can implement lazytime in a sane way, such that fsync() works correctly, we don't throw away timstamp changes in memory reclaim and we don't write unlogged changes to the on-disk locations.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs