On Wed, Oct 20, 2010 at 11:01 AM, Wu Fengguang <fengguang.wu@xxxxxxxxx> wrote: > On Wed, Oct 20, 2010 at 03:25:49PM +0800, Torsten Kaiser wrote: >> On Wed, Oct 20, 2010 at 7:57 AM, Wu Fengguang <fengguang.wu@xxxxxxxxx> wrote: >> > On Tue, Oct 19, 2010 at 06:06:21PM +0800, Torsten Kaiser wrote: >> >> swap_writepage() uses get_swap_bio() which uses bio_alloc() to get one >> >> bio. That bio is the submitted, but the submit path seems to get into >> >> make_request from raid1.c and that allocates a second bio from >> >> bio_alloc() via bio_clone(). >> >> >> >> I am seeing this pattern (swap_writepage calling >> >> md_make_request/make_request and then getting stuck in mempool_alloc) >> >> more than 5 times in the SysRq+T output... >> > >> > I bet the root cause is the failure of pool->alloc(__GFP_NORETRY) >> > inside mempool_alloc(), which can be fixed by this patch. >> >> No. I tested the patch (ontop of Neils fix and your patch regarding >> too_many_isolated()), but the system got stuck the same way on the >> first try to fill the tmpfs. >> I think the basic problem is, that the mempool that should guarantee >> progress is exhausted because the raid1 device is stacked between the >> pageout code and the disks and so the "use only 1 bio"-rule gets >> violated. > > The mempool get exhausted because pool->alloc() failed at least 2 > times. But there are no such high memory pressure except for some > parallel reclaimers. It seems the below patch does not completely > stop the page allocation failure, hence does not stop the deadlock. > > As you and KOSAKI said, the root cause is BIO_POOL_SIZE being smaller > than the total possible allocations in the IO stack. Then why not > bumping up BIO_POOL_SIZE to something like 64? It will be large enough > to allow multiple stacked IO layers. > > And the larger value will allow more concurrent flying IOs for better > IO throughput in such situation. Commit 5972511b7 lowers it from 256 > to 2 because it believes that pool->alloc() will only fail on somehow > OOM situation. However truth is __GFP_NORETRY allocations fail much > more easily in _normal_ operations (whenever there are multiple > concurrent page reclaimers). We have to be able to perform better in > such situation. The __GFP_NORETRY patch to reduce failures is one > option, increasing BIO_POOL_SIZE is another. > > So would you try this fix? While it seems to fix the hang (Just vanilla -rc8, even without the fix from Neil to raid1.c, did not hang during multiple runs of my testscript), I believe this is not a fix for the problem. To quote comment above bio_alloc() from fs/bio.c: * If %__GFP_WAIT is set, then bio_alloc will always be able to allocate * a bio. This is due to the mempool guarantees. To make this work, callers * must never allocate more than 1 bio at a time from this pool. Callers * that need to allocate more than 1 bio must always submit the previously * allocated bio for IO before attempting to allocate a new one. Failure to * do so can cause livelocks under memory pressure. So it seems that limiting fs_bio_set to only 2 entries was intended. And the commit comment from the change that reduced this from 256 to 2 even said that this pool is only intended as a last resort to sustain progress. And I think in my testcase the important thing is not good performance, but only to make sure the system does not hang. Both of the situations that cause the hang for me where more cases of "what not to do" then anything that should perform good. In the original case the system started too many gcc's because I'm to lazy to figure out a better way to organize the parallelization of independent singlethreaded compiles and parallel makes. The Gentoo package manager tries to use the load average to that point, but this is foiled if a compile first has a singlethreaded part (like configure) and only later switches to parallel compiles. So portage started 4 package compilations, because during configure the load was low and then the system had to deal with 20 gcc's (make -j5) eating all of its memory. And even that seemed to only happen during one part of the compiles, as in the not hanging cases, the swapping soon stopped. So it would just have to survive the initial overallocation with the small mempool and everything is find. My reduced testcase is even more useless as an example for a real load: I'm just using multiple dd's to fill a tmpfs as fast as I can to see if raid1.c::make_request() breaks under memory pressure. And here too the only goal should be that the kernel should survive this abuse. Sorry, but increasing BIO_POOL_SIZE just looks like papering over the real problem... Torsten > --- linux-next.orig/include/linux/bio.h 2010-10-20 16:55:57.000000000 +0800 > +++ linux-next/include/linux/bio.h 2010-10-20 16:56:54.000000000 +0800 > @@ -286,7 +286,7 @@ static inline void bio_set_completion_cp > * These memory pools in turn all allocate from the bio_slab > * and the bvec_slabs[]. > */ > -#define BIO_POOL_SIZE 2 > +#define BIO_POOL_SIZE 64 > #define BIOVEC_NR_POOLS 6 > #define BIOVEC_MAX_IDX (BIOVEC_NR_POOLS - 1) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href