Re: [PATCH] mm: disallow direct reclaim page writeback

Chris Mason <chris.mason@xxxxxxxxxx> · Wed, 14 Apr 2010 07:20:15 -0400

On Wed, Apr 14, 2010 at 12:06:36PM +0200, Andi Kleen wrote:
> Chris Mason <chris.mason@xxxxxxxxxx> writes:
> >
> > Huh, 912 bytes...for select, really?  From poll.h:
> >
> > /* ~832 bytes of stack space used max in sys_select/sys_poll before allocating
> >    additional memory. */
> > #define MAX_STACK_ALLOC 832
> > #define FRONTEND_STACK_ALLOC    256
> > #define SELECT_STACK_ALLOC      FRONTEND_STACK_ALLOC
> > #define POLL_STACK_ALLOC        FRONTEND_STACK_ALLOC
> > #define WQUEUES_STACK_ALLOC     (MAX_STACK_ALLOC - FRONTEND_STACK_ALLOC)
> > #define N_INLINE_POLL_ENTRIES   (WQUEUES_STACK_ALLOC / sizeof(struct poll_table_entry))
> >
> > So, select is intentionally trying to use that much stack.  It should be using
> > GFP_NOFS if it really wants to suck down that much stack...
> 
> There are lots of other call chains which use multiple KB bytes by itself,
> so why not give select() that measly 832 bytes?
> 
> You think only file systems are allowed to use stack? :)

Grin, most definitely.

> 
> Basically if you cannot tolerate 1K (or more likely more) of stack
> used before your fs is called you're toast in lots of other situations
> anyways.

Well, on a 4K stack kernel, 832 bytes is a very large percentage for
just one function.

Direct reclaim is a problem because it splices parts of the kernel that
normally aren't connected together.  The people that code in select see
832 bytes and say that's teeny, I should have taken 3832 bytes.

But they don't realize their function can dive down into ecryptfs then
the filesystem then maybe loop and then perhaps raid6 on top of a
network block device.

> 
> > kernel had some sort of way to dynamically allocate ram, it could try
> > that too.
> 
> It does this for large inputs, but the whole point of the stack fast
> path is to avoid it for common cases when a small number of fds is
> only needed.
> 
> It's significantly slower to go to any external allocator.

Yeah, but since the call chain does eventually go into the allocator,
this function needs to be more stack friendly.

I do agree that we can't really solve this with noinline_for_stack pixie
dust, the long call chains are going to be a problem no matter what.

Reading through all the comments so far, I think the short summary is:

Cleaning pages in direct reclaim helps the VM because it is able to make
sure that lumpy reclaim finds adjacent pages.  This isn't a fast
operation, it has to wait for IO (infinitely slow compared to the CPU).

Will it be good enough for the VM if we add a hint to the bdi writeback
threads to work on a general area of the file?  The filesystem will get
writepages(), the VM will get the IO it needs started.

I know Mel mentioned before he wasn't interested in waiting for helper
threads, but I don't see how we can work without it.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html