Hi Dave On Mon, Nov 30, 2015 at 6:10 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > On Fri, Nov 27, 2015 at 09:43:50PM -0500, Glauber Costa wrote: >> Hello my dear XFSers, >> >> For those of you who don't know, we at ScyllaDB produce a modern NoSQL >> data store that, at the moment, runs on top of XFS only. We deal >> exclusively with asynchronous and direct IO, due to our >> thread-per-core architecture. Due to that, we avoid issuing any >> operation that will sleep. >> >> While debugging an extreme case of bad performance (most likely >> related to a not-so-great disk), I have found a variety of cases in >> which XFS blocks. To find those, I have used perf record -e >> sched:sched_switch -p <pid_of_db>, and I am attaching the perf report >> as xfs-sched_switch.log. Please note that this doesn't tell me for how >> long we block, but as mentioned before, blocking operations outside >> our control are detrimental to us regardless of the elapsed time. >> >> For those who are not acquainted to our internals, please ignore >> everything in that file but the xfs functions. For the xfs symbols, >> there are two kinds of events: the ones that are a children of >> io_submit, where we don't tolerate blocking, and the ones that are >> children of our helper IO thread, to where we push big operations that >> we know will block until we can get rid of them all. We care about the >> former and ignore the latter. >> >> Please allow me to ask you a couple of questions about those findings. >> If we are doing anything wrong, advise on best practices is truly >> welcome. >> >> 1) xfs_buf_lock -> xfs_log_force. >> >> I've started wondering what would make xfs_log_force sleep. But then I >> have noticed that xfs_log_force will only be called when a buffer is >> marked stale. Most of the times a buffer is marked stale seems to be >> due to errors. Although that is not my case (more on that), it got me >> thinking that maybe the right thing to do would be to avoid hitting >> this case altogether? > > The buffer is stale because it has recently been freed, and we > cannot re-use it until the transaction that freed it has been > committed to the journal. e.g. this trace: > > --3.15%-- _xfs_log_force > xfs_log_force > xfs_buf_lock > _xfs_buf_find > xfs_buf_get_map > xfs_trans_get_buf_map > xfs_btree_get_bufl > xfs_bmap_extents_to_btree > xfs_bmap_add_extent_hole_real > xfs_bmapi_write > xfs_iomap_write_direct > __xfs_get_blocks > xfs_get_blocks_direct > do_blockdev_direct_IO > __blockdev_direct_IO > xfs_vm_direct_IO > xfs_file_dio_aio_write > xfs_file_write_iter > aio_run_iocb > do_io_submit > sys_io_submit > entry_SYSCALL_64_fastpath > io_submit > 0x46d98a > > implies something like this has happened: > > truncate > free extent > extent list now fits inline in inode > btree-to-extent format change > free last bmbt block X > mark block contents stale > add block to busy extent list > place block on AGFL > > AIO write > allocate extent > inline extents full > extent-to-btree conversion > allocate bmbt block > grab block X from free list > get locked buffer for block X > xfs_buf_lock > buffer stale > log force to commit previous free transaction to disk > ..... > log force completes > buffer removed from busy extent list > buffer no longer stale > add bmbt record to new block > update btree indexes > .... > > > And, looking at the trace you attached, we see this: > > dump_stack > xfs_buf_stale > xfs_trans_binval > xfs_bmap_btree_to_extents <<<<<<<<< > xfs_bunmapi > xfs_itruncate_extents > > So I'd say this is a pretty clear indication that we're immediately > recycling freed blocks from the free list here.... > >> The file example-stale.txt contains a backtrace of the case where we >> are being marked as stale. It seems to be happening when we convert >> the the inode's extents from unwritten to real. Can this case be >> avoided? I won't pretend I know the intricacies of this, but couldn't >> we be keeping extents from the very beginning to avoid creating stale >> buffers? >> >> 2) xfs_buf_lock -> down >> This is one I truly don't understand. What can be causing contention >> in this lock? We never have two different cores writing to the same >> buffer, nor should we have the same core doingCAP_FOWNER so. > > As Brian pointed out, this is probably AGF or AGI contention - > attempting to allocate/free extents or inodes in the same AG at the > same time will show this sort of pattern. This trace shows AGF > contention: > > down > xfs_buf_lock > _xfs_buf_find > xfs_buf_get_map > xfs_buf_read_map > xfs_trans_read_buf_map > xfs_read_agf > ..... > > This trace shows AGI contention: > > down > xfs_buf_lock > _xfs_buf_find > xfs_buf_get_map > xfs_buf_read_map > xfs_trans_read_buf_map > xfs_read_agi > .... > Great. I will take a look at how can we mitigate those on our side. I will need some time to understand all of that better, so for now I'd just leave you guys with a big thank you. >> 3) xfs_file_aio_write_checks -> file_update_time -> xfs_vn_update_time > > This trace?: > > rwsem_down_write_failed > call_rwsem_down_write_failed > xfs_ilock > xfs_vn_update_time > file_update_time > xfs_file_aio_write_checks > xfs_file_dio_aio_write > xfs_file_write_iter > aio_run_iocb > do_io_submit > sys_io_submit > entry_SYSCALL_64_fastpath > io_submit > 0x46d98a > > Which is an mtime timestamp update racing with another operation > that takes the internal metadata lock (e.g. block mapping/allocation > for that inode). > >> You guys seem to have an interface to avoid that, by setting the >> FMODE_NOCMTIME flag. > > We've talked about exposing this through open() for Ceph. > > http://www.kernelhub.org/?p=2&msg=744325 > https://lkml.org/lkml/2015/5/15/671 > > Read the first thread for why it's problematic to expose this to > userspace - I won't repeat it all here. > > As it is, there was some code recently hacked into ext4 to reduce > mtime overhead - the MS_LAZYTIME superblock option. What it does it > prevent the inode from being marked dirty when timestamps are > updated and hence the timestamps are never journalled or written > until something else marks the inode metadata dirty (e.g. block > allocation). ext4 gets away with this because it doesn't actually > journal timestamp changes - they get captured in the journal by > other modifications that are journalled, but still rely on the inod > being marked dirty for fsync, writeback and inode cache eviction > doing the right thing. > > The ext4 implementation looks like the timestamp updates can be > thrown away, as the inodes are not marked dirty and so on memory > pressure they will simply be reclaimed without writing back the > updated timestamps that are held in memory. I suspect fsync will > also have problems on ext4 as the inode is not metadata dirty or > journalled, and hence the timestamp changes will never get written > back. > > And, IMO, the worst part of the ext4 implementation is that the > inode buffer writeback code in ext4 now checks to see if any of the > other inodes in the buffer being written back need to have their > inode timestamps updated. IOWs, ext4 now does writeback of > /unjournalled metadata/ to inodes that are purely timestamp dirty. > > We /used/ to do shit like this in XFS. We got rid of it in > preference of journalling everything because the corner cases in log > recovery meant that after a crash the inodes were in inconsistent > states, and that meant we had unexpected, unpredictable recovery > behaviour where files weren't the expected size and/or didn't > contain the expected data. Hence going back to the bad old days of > hacking around the journal "for speed" doesn't exactly fill me with > joy. > > Let me have a think about how we can implement lazytime in a sane > way, such that fsync() works correctly, we don't throw away > timstamp changes in memory reclaim and we don't write unlogged > changes to the on-disk locations.... I trust you fully for matters related to speed. Keep in mind, though, that at least for us the fact that it blocks is a lot worse than the fact that it is slow. We can work around slow, but blocking basically means that we won't have any more work to push - since we don't do threading. The processor that stales just sits idle until the lock is released. So any non-blocking solution to this would already be a win for us. > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs