Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg

Shakeel Butt <shakeelb@xxxxxxxxxx> · Tue, 20 Feb 2018 11:20:15 -0800

On Tue, Feb 20, 2018 at 4:43 AM, Jan Kara <jack@xxxxxxx> wrote:
> On Mon 19-02-18 21:07:28, Amir Goldstein wrote:
>> On Mon, Feb 19, 2018 at 3:50 PM, Jan Kara <jack@xxxxxxx> wrote:
>> [...]
>> > For fanotify without FAN_UNLIMITED_QUEUE the situation is similar as for
>> > inotify - IMO low practical impact, apps should generally handle queue
>> > overflow so I don't see a need for any opt in (more accurate memcg charging
>> > takes precedense over possibly broken apps).
>> >
>> > For fanotify with FAN_UNLIMITED_QUEUE the situation is somewhat different -
>> > firstly there is a practical impact (memory consumption is not limited by
>> > anything else) and secondly there are higher chances of the application
>> > breaking (no queue overflow expected) and also that this breakage won't be
>> > completely harmless (e.g., the application participates in securing the
>> > system). I've been thinking about this "conflict of interests" for some
>> > time and currently I think that the best handling of this is that by
>> > default events for FAN_UNLIMITED_QUEUE groups will get allocated with
>> > GFP_NOFAIL - such groups can be created only by global CAP_SYS_ADMIN anyway
>> > so it is reasonably safe against misuse (and since the allocations are
>> > small it is in fact equivalent to current status quo, just more explicit).
>> > That way application won't see unexpected queue overflow. The process
>> > generating event may be looping in the allocator but that is the case
>> > currently as well. Also the memcg with the consumer of events will have
>> > higher chances of triggering oom-kill if events consume too much memory but
>> > I don't see how this is not a good thing by default - and if such reaction
>> > is not desirable, there's memcg's oom_control to tune the OOM behavior
>> > which has capabilities far beyond of what we could invent for fanotify...
>> >
>> > What do you think Amir?
>> >
>>
>> If I followed all your reasoning correctly, you propose to change behavior to
>> always account events to group memcg and never fail event allocation,
>> without any change of API and without opting-in for new behavior?
>> I think it makes sense. I can't point at any expected breakage,
>> so overall, this would be a good change.
>>
>> I just feel sorry about passing an opportunity to improve functionality.
>> The fact that fanotify does not have a way for defining the events queue
>> size is a deficiency IMO, one which I had to work around in the past.
>> I find that assigning group to memgc and configure memcg to desired
>> memory limit and getting Q_OVERFLOW on failure to allocate event
>> is going to be a proper way of addressing this deficiency.
>
> So if you don't pass FAN_Q_UNLIMITED, you will get queue with a fixed size
> and will get Q_OVERFLOW if that is exceeded. So is your concern that you'd
> like some other fixed limit? Larger one or smaller one and for what
> reason?
>
>> But if you don't think we should bind these 2 things together,
>> I'll let Shakeel decide if he want to pursue the Q_OVERFLOW change
>> or not.
>
> So if there is still some uncovered use case for finer tuning of event
> queue length than setting or not setting FAN_Q_UNLIMITED (+ possibly
> putting the task to memcg to limit memory usage), we can talk about how to
> address that but at this point I don't see a strong reason to bind this to
> whether / how events are accounted to memcg...
>
> And we still need to make sure we properly do ENOMEM -> Q_OVERFLOW
> translation and use GFP_NOFAIL for FAN_Q_UNLIMITED groups before merging
> Shakeel's memcg accounting patches. But Shakeel does not have to be the one
> implementing that (although if you want to, you are welcome Shakeel :) -
> otherwise I hope I'll get to it reasonably soon).
>

Thanks Jan & Amir for the help and explanation. I think, Jan, you can
implement the "ENOMEM -> Q_OVERFLOW" and GFP_NOFAIL changes better
than me. I will send out my patches with minor changes based on
feedback but I will let Andrew know to keep my patches in mm tree and
not send for upstream merge. Once Jan has added his patches, I will
Andrew know to go forward with my patches.

thanks,
Shakeel