Re: [PATCH 1/2] vmpressure: in-kernel notifications

Glauber Costa <glommer@xxxxxxxxxxxxx> · Wed, 24 Apr 2013 12:36:00 +0400

On 04/24/2013 11:21 AM, Greg Thelen wrote:
> On Tue, Apr 23 2013, Glauber Costa wrote:
> 
>> From: Glauber Costa <glommer@xxxxxxxxxxxxx>
>>
>> During the past weeks, it became clear to us that the shrinker interface
>> we have right now works very well for some particular types of users,
>> but not that well for others. The later are usually people interested in
>> one-shot notifications, that were forced to adapt themselves to the
>> count+scan behavior of shrinkers. To do so, they had no choice than to
>> greatly abuse the shrinker interface producing little monsters all over.
>>
>> During LSF/MM, one of the proposals that popped out during our session
>> was to reuse Anton Voronstsov's vmpressure for this. They are designed
>> for userspace consumption, but also provide a well-stablished,
>> cgroup-aware entry point for notifications.
>>
>> This patch extends that to also support in-kernel users. Events that
>> should be generated for in-kernel consumption will be marked as such,
>> and for those, we will call a registered function instead of triggering
>> an eventfd notification.
>>
>> Please note that due to my lack of understanding of each shrinker user,
>> I will stay away from converting the actual users, you are all welcome
>> to do so.
>>
>> Signed-off-by: Glauber Costa <glommer@xxxxxxxxxx>
>> Cc: Dave Chinner <david@xxxxxxxxxxxxx>
>> Cc: Anton Vorontsov <anton.vorontsov@xxxxxxxxxx>
>> Cc: John Stultz <john.stultz@xxxxxxxxxx>
>> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
>> Cc: Joonsoo Kim <js1304@xxxxxxxxx>
>> Cc: Michal Hocko <mhocko@xxxxxxx>
>> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
>> Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
>> ---
>>  include/linux/vmpressure.h |  6 ++++++
>>  mm/vmpressure.c            | 48 ++++++++++++++++++++++++++++++++++++++++++----
>>  2 files changed, 50 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
>> index 76be077..1862012 100644
>> --- a/include/linux/vmpressure.h
>> +++ b/include/linux/vmpressure.h
>> @@ -19,6 +19,9 @@ struct vmpressure {
>>  	/* Have to grab the lock on events traversal or modifications. */
>>  	struct mutex events_lock;
>>  
>> +	/* false if only kernel users want to be notified, true otherwise */
>> +	bool notify_userspace;
>> +
>>  	struct work_struct work;
>>  };
>>  
>> @@ -36,6 +39,9 @@ extern struct vmpressure *css_to_vmpressure(struct cgroup_subsys_state *css);
>>  extern int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
>>  				     struct eventfd_ctx *eventfd,
>>  				     const char *args);
>> +
>> +extern int vmpressure_register_kernel_event(struct cgroup *cg,
>> +					    void (*fn)(void));
>>  extern void vmpressure_unregister_event(struct cgroup *cg, struct cftype *cft,
>>  					struct eventfd_ctx *eventfd);
>>  #else
>> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
>> index 736a601..8d77ad0 100644
>> --- a/mm/vmpressure.c
>> +++ b/mm/vmpressure.c
>> @@ -135,8 +135,12 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
>>  }
>>  
>>  struct vmpressure_event {
>> -	struct eventfd_ctx *efd;
>> +	union {
>> +		struct eventfd_ctx *efd;
>> +		void (*fn)(void);
>> +	};
>>  	enum vmpressure_levels level;
>> +	bool kernel_event;
>>  	struct list_head node;
>>  };
>>  
>> @@ -152,7 +156,9 @@ static bool vmpressure_event(struct vmpressure *vmpr,
>>  	mutex_lock(&vmpr->events_lock);
>>  
>>  	list_for_each_entry(ev, &vmpr->events, node) {
>> -		if (level >= ev->level) {
>> +		if (ev->kernel_event)
>> +			ev->fn();
>> +		else if (vmpr->notify_userspace && (level >= ev->level)) {
>>  			eventfd_signal(ev->efd, 1);
>>  			signalled = true;
>>  		}
>> @@ -227,7 +233,7 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
>>  	 * we account it too.
>>  	 */
>>  	if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
>> -		return;
>> +		goto schedule;
>>  
>>  	/*
>>  	 * If we got here with no pages scanned, then that is an indicator
>> @@ -238,14 +244,16 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
>>  	 * through vmpressure_prio(). But so far, keep calm.
>>  	 */
>>  	if (!scanned)
>> -		return;
>> +		goto schedule;
> 
> If goto schedule is taken here then scanned==0.  Then
> scanned<vmpressure_win (below), so this function would always simply
> return.  So this change seems like a no-op.  Should the schedule: below
> be just before schedule_work(&vmpr->work)?  But this wouldn't do much
> either because vmpressure_work_fn() would immediately return if
> vmpr->scanned==0.  Presumable the idea is to avoid notifying user space
> or kernel callbacks if lru pages are not scanned - at least until
> vmpressure_prio() is called with a priority more desperate than
> vmpressure_level_critical_prio at which time this function's scanned!=0.
> 

Yes, the idea is to avoid calling the callbacks. I can just return at
this point if you prefer. I figured that jumping to the common entry
point would be more consistent, only that. I don't care either way.

>>  
>>  	mutex_lock(&vmpr->sr_lock);
>>  	vmpr->scanned += scanned;
>>  	vmpr->reclaimed += reclaimed;
>> +	vmpr->notify_userspace = true;
>>  	scanned = vmpr->scanned;
>>  	mutex_unlock(&vmpr->sr_lock);
>>  
>> +schedule:
>>  	if (scanned < vmpressure_win || work_pending(&vmpr->work))
>>  		return;
>>  	schedule_work(&vmpr->work);
>> @@ -328,6 +336,38 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
>>  }
>>  
>>  /**
>> + * vmpressure_register_kernel_event() - Register kernel-side notification
>> + * @cg:		cgroup that is interested in vmpressure notifications
>> + * @fn:		function to be called when pressure happens
>> + *
>> + * This function register in-kernel users interested in receiving notifications
>> + * about pressure conditions. Pressure notifications will be triggered at the
>> + * same time as userspace notifications (with no particular ordering relative
>> + * to it).
>> + *
>> + * Pressure notifications are a alternative method to shrinkers and will serve
>> + * well users that are interested in a one-shot notification, with a
>> + * well-defined cgroup aware interface.
>> + */
>> +int vmpressure_register_kernel_event(struct cgroup *cg, void (*fn)(void))
> 
> It seems useful to include the "struct cgroup *" as a parameter to fn.
> This would allow for fn to shrink objects it's caching in the cgroup.
> 
> Also, why not allow level specification for kernel events?
> 
Because I don't want to overdesign. This is a in-kernel API, so we can
change it if we want to. There is only one user, and that is called from
the root cgroup, without level distinction.

The cgroup argument makes sense, but I would rather leave it as is for
now. As for levels, it might make sense as well, but I would much rather
leave the implementation to someone actually using them - specially
since this is not a simple parameter passing.

> It might be neat if vmpressure_register_event() used
> vmpressure_register_kernel_event() with a callback function calls
> eventfd_signal().  This would allow for a uniform event notification
> type which is agnostic of user vs kernel.  However, as proposed there
> are different signaling conditions.  So I'm not sure it's worth the time
> to combine the even types.  So feel free to ignore this paragraph.
> 

I don't think it is worth it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>