On Thu, Feb 08, 2018 at 10:24:11AM +0200, Avi Kivity wrote: > On 02/08/2018 01:33 AM, Dave Chinner wrote: > >On Wed, Feb 07, 2018 at 07:20:17PM +0200, Avi Kivity wrote: > >>As usual, I'm having my lovely io_submit()s sleeping. This time some > >>detailed traces. 4.14.15. [....] > >>Forcing the log, so sleeping with ILOCK taken. > >Because it's trying to reallocate an extent that is pinned in the > >log and is marked stale. i.e. we are reallocating a recently freed > >metadata extent that hasn't been committed to disk yet. IOWs, it's > >the metadata form of the "log force to clear a busy extent so we can > >re-use it" condition.... > > > >There's nothing you can do to reliably avoid this - it's a sign that > >you're running low on free space in an AG because it's recycling > >recently freed space faster than the CIL is being committed to disk. > > > >You could speed up background journal syncs to try to reduce the > >async checkpoint latency that allows busy extents to build up > >(/proc/sys/fs/xfs/xfssyncd_centisecs) but that also impacts on > >journal overhead and IO latency, etc. > > Perhaps xfs should auto-tune this variable. That's not a fix. That's a nasty hack that attempts to hide the underlying problem of selecting AGs and/or free space that requires a log force to be used instead of finding other, un-encumbered freespace present in the filesystem. i.e. We know what the underlying problem is here, and there isn't a quick fix for it. You're just going to have to live with that until we work out a reliable, robust way of avoiding having busy extents block allocations. > >>reactor-29 3336 [029] 3580.420137: xfs:xfs_ilock: dev 9:0 ino > >>0x50000a9f flags ILOCK_EXCL caller kretprobe_trampoline > >> 7fffa0858572 xfs_ilock ([kernel.kallsyms]) > >> 7fff8105f5a0 kretprobe_trampoline ([kernel.kallsyms]) > >> 7fff8127314c file_update_time ([kernel.kallsyms]) > >> 7fffa0849ab9 xfs_file_aio_write_checks ([kernel.kallsyms]) > >> 7fffa0849bce xfs_file_dio_aio_write ([kernel.kallsyms]) > >> 7fffa084a13f xfs_file_write_iter ([kernel.kallsyms]) > >> 7fff812ab24f aio_write ([kernel.kallsyms]) > >> 7fff812ab90e do_io_submit ([kernel.kallsyms]) > >> 7fff812ac4c0 sys_io_submit ([kernel.kallsyms]) > >> 7fff81005917 do_syscall_64 ([kernel.kallsyms]) > >> 7fff81802115 return_from_SYSCALL_64 ([kernel.kallsyms]) > >>io_submit() stalls. Later, task 8146 relinquishes the lock, which > >>(after passing through another worker thread), is finally acquired > >>by io_submit() which the continues. > >Yup, so it's the timestamp updates at IO submission that are > >blocking waiting for the *ILOCK* lock. > > > >[.....] > > > >>The other is to check that ILOCK can be taken exclusively in > >>xfs_file_dio_aio_write, if IOCB_NOWAIT. If this is the way to go, I > >>might venture a patch. > >You're getting your locks mixed up - xfs_file_dio_aio_write() > >doesn't ever take the ILOCK directly. It takes the _IOLOCK_ > >directly, and that's what IOCB_NOWAIT acts on. > > It also takes ILOCK by calling file_update_time(), as seen in the > trace above. I've seen that it does a _nowait acquisition of the > IOLOCK, but that's not enough. > > If we could pass the iocb to file_update_time() then > xfs_vn_update_time, then it could perform a _nowait acquisition of > ILOCK. Otherwise, we can just check if ILOCK is acquirable in > xfs_file_aio_write_checks(). That's racy, but better than nothing. Sure, but that's not the issue with such suggestions. I'm not about to propose a patch to hack an iocb through generic infrastructure that doesn't need an iocb. Doing stuff like that will only get me shouted at because it's a massive layering violation and I should (and do!) know better.... > >require the _ILOCK_ to be taken exclusively, and there's no way > >of knowing if that is going to be needed at the level of > >xfs_file_dio_aio_write(). > > > >There are hooks in xfs_file_aio_write_checks() to avoid timestamp > >updates on write (FMODE_NOCMTIME) but there is no way to set this > >from userspace. It's used exclusively by the XFS open-by-handle > >code so that utilities like online defrag can write data into > >files without changing their user-visible timestamps. > > > > xfs_file_aio_write_checks() knows its going to call > file_update_time(), and it can guess that file_update_time() will > call xfs_vn_update_time(). So just before that call, check that > ILOCK is free and EAGAIN if not. Maybe that patch won't win any > beauty contests but it will reduce some of my pain. Again, grose, unmaintable layering violations. Expeditious solution to cater for your immediate need - you're welcome to do this with your own kernels, but it's not a solution we can really carry upstream. And that brings me to the real issue here: If your requirement really is "never block io_submit(), ever", then the correct change to make is to the aio code so that it will /always queue IO/ and never attempt to submit it directly. If you get that done, then we can stop playing this tiresome whack-a-mole game in XFS..... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html