Re: [PATCH -v2 -mm] add extra free kbytes tunable

KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> · Thu, 13 Oct 2011 17:09:07 +0900

On Thu, 13 Oct 2011 16:33:21 +0900
Minchan Kim <minchan.kim@xxxxxxxxx> wrote:

> On Fri, Sep 02, 2011 at 12:31:14PM -0400, Satoru Moriya wrote:
> > On 09/01/2011 05:58 PM, Andrew Morton wrote:
> > > On Thu, 1 Sep 2011 15:26:50 -0400
> > > Rik van Riel <riel@xxxxxxxxxx> wrote:
> > > 
> > >> Add a userspace visible knob
> > > 
> > > argh.  Fear and hostility at new knobs which need to be maintained for 
> > > ever, even if the underlying implementation changes.
> > > 
> > > Unfortunately, this one makes sense.
> > > 
> > >> to tell the VM to keep an extra amount of memory free, by increasing 
> > >> the gap between each zone's min and low watermarks.
> > >>
> > >> This is useful for realtime applications that call system calls and 
> > >> have a bound on the number of allocations that happen in any short 
> > >> time period.  In this application, extra_free_kbytes would be left at 
> > >> an amount equal to or larger than the maximum number of 
> > >> allocations that happen in any burst.
> > > 
> > > _is_ it useful?  Proof?
> > > 
> > > Who is requesting this?  Have they tested it?  Results?
> > 
> > This is interesting for me.
> > 
> > Some of our customers have realtime applications and they are concerned 
> > the fact that Linux uses free memory as pagecache. It means that
> > when their application allocate memory, Linux kernel tries to reclaim
> > memory at first and then allocate it. This may make memory allocation
> > latency bigger.
> > 
> > In many cases this is not a big issue because Linux has kswapd for
> > background reclaim and it is fast enough not to enter direct reclaim
> > path if there are a lot of clean cache. But under some situations -
> > e.g. Application allocates a lot of memory which is larger than delta
> > between watermark_low and watermark_min in a short time and kswapd
> > can't reclaim fast enough due to dirty page reclaim, direct reclaim
> > is executed and causes big latency.
> > 
> > We can avoid the issue above by using preallocation and mlock.
> > But it can't cover kmalloc used in systemcall. So I'd like to use
> > this patch with mlock to avoid memory allocation latency issue as
> > low as possible. It may not be a perfect solution but it is important
> > for customers in enterprise area to configure the amount of free
> > memory at their own risk.
> 
> I agree needs for such feature but don't like such primitive interface
> exporting to user.
> 
> As Satoru said, we can reserve free pages for user through preallocation and mlocking.
> The thing is free pages for kernel itself.
> Most desirable thing is we have to avoid syscall in critical realtime section.
> But if we can't avoid, my crazy idea is to use memcg for kernel pages.
> Of course, we should implement it and not simple stuff but AFAIK, memcg people
> always consider it and finally will do it. :)
> Recently, Glauber try "Basic kernel memory functionality" but I don't have reviewed
> it yet. I am not sure we can reuse it, anyway. Kame?
> 

I reviewed it and it seems good. It adds kmem.limit_in_bytes then we're ready
to go forward to kernel memory cgroup.
But it adds only interfaces now.

I think  Greg Thelen <gthelen@xxxxxxxxxx> has some idea.

> My simple idea is as follows,
> 
> We can assign basic revered page pool and/or size of user-determined pages pool
> for each task registred at memcg-slab.

Hmm, memcg-mempool ?

> The application have to notify start of RT section to memcg before it goes to
> RT section. So, memcg could fill up page pool if it is short. In this case,
> application can stuck but it's okay as it doesn't go to RT section yet.
> The applicatoin have to notify end of RT section to memcg, too so that memcg
> could try to fill up reserved page pool in case of shortage.
> 

That 'notification' doesn't sounds good to me. When application died/moved to
other group without notification, memcg will be unstable.
It should be task's state rather than memcg's state.

> Why we need such notification is kswapd high prioiry, new knob and others never
> can meet application's deadline requirement in some situations(ex,
> there are so many dirty pages in LRU or fill up anon pages in non-swap case and so on)
> so that application might end up stuck at some point. The somepoint must be out of RT
> section of the task.
> 
> For implemenation, we might need new watermark setting for each memcg or/and
> kswapd prioirity promotion like thing for hurry reclaiming.
> Anyway, they are just implementaions and we could enhance/add further more through
> various techniques as time goes by.
> 
> Personally, I think it could a valuable featue.
> 

Hmm. For avoid latency at allocation, what we can do is only pre-allocation before it's
required. But the problem is that applications cannot forecast when the 'burst' allocation
happens and we need to prepare memory pool always.

I think we need 2 implemenations.

1. free-page mempool for a memcg.
2. a background reclaim thread for a memcg. This is triggered by mempool.
   Prioritity of this thread should be able to controlled by some ways.

If we take care of memcg's limit, watermark should trigger background reclaim.

?
But the memory reclaim routine should never be in sleep...

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>