On Wed, Sep 08, 2010 at 06:49:23PM +1000, Dave Chinner wrote: > On Tue, Sep 07, 2010 at 07:50:23AM +1000, Dave Chinner wrote: > > On Mon, Sep 06, 2010 at 09:40:15AM +0100, Mel Gorman wrote: > > > On Mon, Sep 06, 2010 at 02:02:43PM +1000, Dave Chinner wrote: > > > > I just went to grab the CAL counters, and found the system in > > > > another livelock. This time I managed to start the sysrq-trigger > > > > dump while the livelock was in progress - I bas??cally got one shot > > > > at a command before everything stopped responding. Now I'm waiting > > > > for the livelock to pass.... 5min.... the fs_mark workload > > > > has stopped (ctrl-c finally responded), still livelocked.... > > > > 10min.... 15min.... 20min.... OK, back now. > > > > > > > > Interesting - all the fs_mark processes are in D state waiting on IO > > > > completion processing. > > > > > > Very interesting, maybe they are all stuck in congestion_wait() this > > > time? There are a few sources where that is possible. > > > > No, they are waiting on log IO completion, not doing allocation or > > in the VM at all. They stuck in xlog_get_iclog_state() waiting for > > all the log IO buffers to be processed which are stuck behind the > > inode buffer IO completions in th kworker threads that I posted. > > > > This potentially is caused by the kworker thread consolidation - log > > IO completion processing used to be in a separate workqueue for > > processing latency and deadlock prevention reasons - the data and > > metadata IO completion can block, whereas we need the log IO > > completion to occur as quickly as possible. I've seen one deadlock > > that the separate work queues solved w.r.t. loop devices, and I > > suspect that part of the problem here is that transaction completion > > cannot occur (and free the memory it and the CIL holds) because log IO > > completion processing is being delayed significantly by metadata IO > > completion... > ..... > > > > Which shows that this wasn't an IPI storm that caused this > > > > particular livelock. > > > > > > No, but it's possible we got stuck somewhere like too_many_isolated() or > > > in congestion_wait. One thing at a time though, would you mind testing > > > the following patch? I haven't tested this *at all* but it should reduce > > > the number of times drain_all_pages() are called further while not > > > eliminating them entirely. > > > > Ok, I'll try it later today, but first I think I need to do some > > deeper investigation on the kworker thread behaviour.... > > Ok, so an update is needed here. I have confirmed that the above > livelock was caused by the kworker thread consolidation, and I have > a fix for it (make the log IO completion processing queue WQ_HIGHPRI > so it gets queued ahead of the data/metadata IO completions), and > I've been able to create over a billion inodes now without a > livelock occurring. See the thread titled "[2.6.36-rc3] Workqueues, > XFS, dependencies and deadlock" if you want more details. > Good stuff. I read through the thread and it seemed reasonable. > To make sure I've been seeing two different livelocks, I removed > Mel's series from my tree (which still contained the above workqueue > fix), and I started seeing short memory allocation livelocks (10-15s > at most) with abnormal increases in CAL counts indication an > increase in IPIs during the short livelocks. IOWs, the livelock > was't as severe as before the workqueue fix, but still present. > Hence the workqueue issue was definitely a contributing factor to > the severity of the memory allocation triggered issue. > Good. Considering that this class of bugs in either the page allocator or page reclaim can be down to timing, it makes sense that a big change in ordering of events could compound problems in the VM. > It is clear that there have been two different livelocks with > different caused by the same test, which has led to a lot of > confusion in this thread. It appears that Mel's patch series as > originally posted in this thread is all that is necessary to avoid > the memory allocation livelock issue I was seeing. The workqueue > fix solves the other livelock I was seeing once Mel's patches were > in place. > > Thanks to everyone for helping me track these livelocks down and > providing lots of suggestions for things to try. I'll keep testing > and looking for livelocks, but my confidence is increasing that > we've got to the root of them now. > It has been pointed out that the fix potentially increases the number of IPIs sent. On larger machines, I worry that these delays could be severe and we'll see other problems down the line. Hence, I'd like to reduce the number of calls to drain_all_pages() without eliminating them entirely. I'm currently in the process of testing the following patch but can you try it as well please? In particular, I am curious to see if the performance of fs_mark improves any and if the interrupt counts drop as a result of the patch. Thanks ==== CUT HERE ==== mm: page allocator: Reduce the instances where drain_all_pages() is called When a page allocation fails after direct reclaim, the per-cpu lists are drained and another attempt made to allocate. On larger systems, this can cause IPI storms in low-memory situations with latencies increasing the more CPUs there are on the system. In extreme situations, it is suspected it could cause livelock-like situations. This patch restores older behaviour to call drain_all_pages() after direct reclaim fails only for high-order allocations. As there is an expectation that lower-orders will free naturally, the drain only occurs for order > PAGE_ALLOC_COSTLY_ORDER. The reasoning is that the allocation is already expected to be very expensive and rare so there will not be a resulting IPI storm. drain_all_pages() called are not eliminated as it is still the case that an allocation can fail because the necessary pages are pinned in the per-cpu list. After this patch, the lists are only drained as a last-resort before calling the OOM killer. Signed-off-by: Mel Gorman <mel@xxxxxxxxx> --- mm/page_alloc.c | 23 ++++++++++++++++++++--- 1 files changed, 20 insertions(+), 3 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 750e1dc..16f516c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1737,6 +1737,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int migratetype) { struct page *page; + bool drained = false; /* Acquire the OOM killer lock for the zones in zonelist */ if (!try_set_zonelist_oom(zonelist, gfp_mask)) { @@ -1744,6 +1745,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, return NULL; } +retry: /* * Go through the zonelist yet one more time, keep very high watermark * here, this is only to catch a parallel oom killing, we must fail if @@ -1773,6 +1775,18 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, if (gfp_mask & __GFP_THISNODE) goto out; } + + /* + * If an allocation failed, it could be because pages are pinned on + * the per-cpu lists. Before resorting to the OOM killer, try + * draining + */ + if (!drained) { + drain_all_pages(); + drained = true; + goto retry; + } + /* Exhausted what can be done so it's blamo time */ out_of_memory(zonelist, gfp_mask, order, nodemask); @@ -1876,10 +1890,13 @@ retry: migratetype); /* - * If an allocation failed after direct reclaim, it could be because - * pages are pinned on the per-cpu lists. Drain them and try again + * If a high-order allocation failed after direct reclaim, it could + * be because pages are pinned on the per-cpu lists. However, only + * do it for PAGE_ALLOC_COSTLY_ORDER as the cost of the IPI needed + * to drain the pages is itself high. Assume that lower orders + * will naturally free without draining. */ - if (!page && !drained) { + if (!page && !drained && order > PAGE_ALLOC_COSTLY_ORDER) { drain_all_pages(); drained = true; goto retry; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>