Re: [PATCH 1/2] vmpressure: in-kernel notifications

Greg Thelen <gthelen@xxxxxxxxxx> · Wed, 24 Apr 2013 12:35:21 -0700

On Wed, Apr 24 2013, Glauber Costa wrote:

> On 04/24/2013 11:21 AM, Greg Thelen wrote:
>> On Tue, Apr 23 2013, Glauber Costa wrote:
>> 
>>> From: Glauber Costa <glommer@xxxxxxxxxxxxx>
>>>
>>> During the past weeks, it became clear to us that the shrinker interface
>>> we have right now works very well for some particular types of users,
>>> but not that well for others. The later are usually people interested in
>>> one-shot notifications, that were forced to adapt themselves to the
>>> count+scan behavior of shrinkers. To do so, they had no choice than to
>>> greatly abuse the shrinker interface producing little monsters all over.
>>>
>>> During LSF/MM, one of the proposals that popped out during our session
>>> was to reuse Anton Voronstsov's vmpressure for this. They are designed
>>> for userspace consumption, but also provide a well-stablished,
>>> cgroup-aware entry point for notifications.
>>>
>>> This patch extends that to also support in-kernel users. Events that
>>> should be generated for in-kernel consumption will be marked as such,
>>> and for those, we will call a registered function instead of triggering
>>> an eventfd notification.
>>>
>>> Please note that due to my lack of understanding of each shrinker user,
>>> I will stay away from converting the actual users, you are all welcome
>>> to do so.
>>>
>>> Signed-off-by: Glauber Costa <glommer@xxxxxxxxxx>
>>> Cc: Dave Chinner <david@xxxxxxxxxxxxx>
>>> Cc: Anton Vorontsov <anton.vorontsov@xxxxxxxxxx>
>>> Cc: John Stultz <john.stultz@xxxxxxxxxx>
>>> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
>>> Cc: Joonsoo Kim <js1304@xxxxxxxxx>
>>> Cc: Michal Hocko <mhocko@xxxxxxx>
>>> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
>>> Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
>>> ---
>>>  include/linux/vmpressure.h |  6 ++++++
>>>  mm/vmpressure.c            | 48 ++++++++++++++++++++++++++++++++++++++++++----
>>>  2 files changed, 50 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
>>> index 76be077..1862012 100644
>>> --- a/include/linux/vmpressure.h
>>> +++ b/include/linux/vmpressure.h
>>> @@ -19,6 +19,9 @@ struct vmpressure {
>>>  	/* Have to grab the lock on events traversal or modifications. */
>>>  	struct mutex events_lock;
>>>  
>>> +	/* false if only kernel users want to be notified, true otherwise */
>>> +	bool notify_userspace;
>>> +
>>>  	struct work_struct work;
>>>  };
>>>  
>>> @@ -36,6 +39,9 @@ extern struct vmpressure *css_to_vmpressure(struct cgroup_subsys_state *css);
>>>  extern int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
>>>  				     struct eventfd_ctx *eventfd,
>>>  				     const char *args);
>>> +
>>> +extern int vmpressure_register_kernel_event(struct cgroup *cg,
>>> +					    void (*fn)(void));
>>>  extern void vmpressure_unregister_event(struct cgroup *cg, struct cftype *cft,
>>>  					struct eventfd_ctx *eventfd);
>>>  #else
>>> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
>>> index 736a601..8d77ad0 100644
>>> --- a/mm/vmpressure.c
>>> +++ b/mm/vmpressure.c
>>> @@ -135,8 +135,12 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
>>>  }
>>>  
>>>  struct vmpressure_event {
>>> -	struct eventfd_ctx *efd;
>>> +	union {
>>> +		struct eventfd_ctx *efd;
>>> +		void (*fn)(void);
>>> +	};
>>>  	enum vmpressure_levels level;
>>> +	bool kernel_event;
>>>  	struct list_head node;
>>>  };
>>>  
>>> @@ -152,7 +156,9 @@ static bool vmpressure_event(struct vmpressure *vmpr,
>>>  	mutex_lock(&vmpr->events_lock);
>>>  
>>>  	list_for_each_entry(ev, &vmpr->events, node) {
>>> -		if (level >= ev->level) {
>>> +		if (ev->kernel_event)
>>> +			ev->fn();
>>> +		else if (vmpr->notify_userspace && (level >= ev->level)) {
>>>  			eventfd_signal(ev->efd, 1);
>>>  			signalled = true;
>>>  		}
>>> @@ -227,7 +233,7 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
>>>  	 * we account it too.
>>>  	 */
>>>  	if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
>>> -		return;
>>> +		goto schedule;
>>>  
>>>  	/*
>>>  	 * If we got here with no pages scanned, then that is an indicator
>>> @@ -238,14 +244,16 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
>>>  	 * through vmpressure_prio(). But so far, keep calm.
>>>  	 */
>>>  	if (!scanned)
>>> -		return;
>>> +		goto schedule;
>> 
>> If goto schedule is taken here then scanned==0.  Then
>> scanned<vmpressure_win (below), so this function would always simply
>> return.  So this change seems like a no-op.  Should the schedule: below
>> be just before schedule_work(&vmpr->work)?  But this wouldn't do much
>> either because vmpressure_work_fn() would immediately return if
>> vmpr->scanned==0.  Presumable the idea is to avoid notifying user space
>> or kernel callbacks if lru pages are not scanned - at least until
>> vmpressure_prio() is called with a priority more desperate than
>> vmpressure_level_critical_prio at which time this function's scanned!=0.
>> 
>
> Yes, the idea is to avoid calling the callbacks. I can just return at
> this point if you prefer. I figured that jumping to the common entry
> point would be more consistent, only that. I don't care either way.

Leave it as is.  I don't really care either way.

>>>  	mutex_lock(&vmpr->sr_lock);
>>>  	vmpr->scanned += scanned;
>>>  	vmpr->reclaimed += reclaimed;
>>> +	vmpr->notify_userspace = true;
>>>  	scanned = vmpr->scanned;
>>>  	mutex_unlock(&vmpr->sr_lock);
>>>  
>>> +schedule:
>>>  	if (scanned < vmpressure_win || work_pending(&vmpr->work))
>>>  		return;
>>>  	schedule_work(&vmpr->work);
>>> @@ -328,6 +336,38 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
>>>  }
>>>  
>>>  /**
>>> + * vmpressure_register_kernel_event() - Register kernel-side notification
>>> + * @cg:		cgroup that is interested in vmpressure notifications
>>> + * @fn:		function to be called when pressure happens
>>> + *
>>> + * This function register in-kernel users interested in receiving notifications
>>> + * about pressure conditions. Pressure notifications will be triggered at the
>>> + * same time as userspace notifications (with no particular ordering relative
>>> + * to it).
>>> + *
>>> + * Pressure notifications are a alternative method to shrinkers and will serve
>>> + * well users that are interested in a one-shot notification, with a
>>> + * well-defined cgroup aware interface.
>>> + */
>>> +int vmpressure_register_kernel_event(struct cgroup *cg, void (*fn)(void))
>> 
>> It seems useful to include the "struct cgroup *" as a parameter to fn.
>> This would allow for fn to shrink objects it's caching in the cgroup.
>> 
>> Also, why not allow level specification for kernel events?
>> 
> Because I don't want to overdesign. This is a in-kernel API, so we can
> change it if we want to. There is only one user, and that is called from
> the root cgroup, without level distinction.
>
> The cgroup argument makes sense, but I would rather leave it as is for
> now. As for levels, it might make sense as well, but I would much rather
> leave the implementation to someone actually using them - specially
> since this is not a simple parameter passing.

If there's going to be a single global listener for now, then I agree.
Leave your change as-is.

>> It might be neat if vmpressure_register_event() used
>> vmpressure_register_kernel_event() with a callback function calls
>> eventfd_signal().  This would allow for a uniform event notification
>> type which is agnostic of user vs kernel.  However, as proposed there
>> are different signaling conditions.  So I'm not sure it's worth the time
>> to combine the even types.  So feel free to ignore this paragraph.
>> 
>
> I don't think it is worth it.

Fine with me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>