On 1/14/17 3:25 PM, Thorvald Natvig wrote: > Hi, > > We've run into a somewhat unexpected condition. Under high memory > pressure and high I/O write pressure on slow media, when doing network > calls, we have a call chain that looks like: > > .. -> tcp_recvmsg -> .. -> do_page_fault -> .. -> > __alloc_pages_slowpath -> try_to_free_pages -> .. -> shrink_slab -> > super_cache_scan -> xfs_fs_free_cached_objects -> > xfs_reclaim_inodes_nr -> xfs_reclaim_inodes_ag -> mutex_lock -> > __mutex_lock_slowpath > > And it stays stuck there. This causes the network traffic to stall, > which causes applications (in this case Ceph OSDs) to fail basic > health checks. You might also want to take a look at the thread: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim from this past mid-october, and also dave's patch sent mid-november in that same thread. (I need to re-read it too). -Eric > This particular call-chain is due to the end of xfs_reclaim_inodes_ag, > which has the code: > > if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0) { > trylock = 0; > goto restart; > } > > The code first tries to release with trylock on the mutex, but it it > fails to release sufficient number of items, and there were groups > that it failed to lock, it tries again with blocking locks. > If another kernel thread holds the mutexes for any reason (such as > currently flushing the group), we essentially make kernel memory > allocation wait for disc I/O. > > On this particular system, we have 30 other XFS filesystems also > mounted, and there's also a lot of non-XFS caches that could be > reclaimed to meet this memory request. There's about 100GB of other > caches that could be released, so why block? > > We've worked around this with the following probe: > > probe module("xfs").function("xfs_reclaim_inodes_ag").call { > printf ("%s -> %s: %d %s [%s]\n", thread_indent(0), > probefunc(), kernel_int($nr_to_scan),kernel_string($mp->m_fsname), > $$parms) > print_backtrace() > $flags = $flags & 2 > } > > In other words, remove the SYNC_WAIT flag to the call. This causes the > slab shrinker to move on to the next candidate for releasing. So far, > this seems to fix all the problems we've seen. The probe could > probably be improved to only do this for the callchain that reaches > xfs_reclaim_inodes_ag from shrink_slab. > > Is there a better way to fix this problem? > > - Thorvald > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html