On Sat, Dec 19, 2020 at 4:31 PM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote: > > On Sat, Dec 19, 2020 at 1:48 AM Amir Goldstein <amir73il@xxxxxxxxx> wrote: > > > > On Sat, Dec 19, 2020 at 12:11 AM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote: > > > > > > Currently the fs sysctl inotify/max_user_instances is used to limit the > > > number of inotify instances on the system. For systems running multiple > > > workloads, the per-user namespace sysctl max_inotify_instances can be > > > used to further partition inotify instances. However there is no easy > > > way to set a sensible system level max limit on inotify instances and > > > further partition it between the workloads. It is much easier to charge > > > the underlying resource (i.e. memory) behind the inotify instances to > > > the memcg of the workload and let their memory limits limit the number > > > of inotify instances they can create. > > > > Not that I have a problem with this patch, but what problem does it try to > > solve? > > I am aiming for the simplicity to not set another limit which can > indirectly be limited by memcg limits. I just want to set the memcg > limit on our production environment which runs multiple workloads on a > system and not think about setting a sensible value to > max_user_instances in production. I would prefer to set > max_user_instances to max int and let the memcg limits of the > workloads limit their inotify usage. > understood. and I guess the multiple workloads cannot run each in their own userns? because then you wouldn't need to change max_user_instances limit. > > Are you concerned of users depleting system memory by creating > > userns's and allocating 128 * (struct fsnotify_group) at a time? > > > > IMO, that is not what max_user_instances was meant to protect against. > > There are two reasons I can think of to limit user instances: > > 1. Pre-memgc, user allocation of events is limited to > > <max_user_instances>*<max_queued_events> > > 2. Performance penalty. User can place <max_user_instances> > > watches on the same "hot" directory, that will cause any access to > > that directory by any task on the system to pay the penalty of traversing > > <max_user_instances> marks and attempt to queue <max_user_instances> > > events. That cost, including <max_user_instances> inotify_merge() loops > > could be significant > > > > #1 is not a problem anymore, since you already took care of accounting events > > to the user's memcg. > > #2 is not addressed by your patch. > > Yes, I am not addressing #2. Our workloads in prod have their own > private filesystems, so this is not an issue we observed. > > > > > > > > > Signed-off-by: Shakeel Butt <shakeelb@xxxxxxxxxx> > > > --- > > > fs/notify/group.c | 14 ++++++++++++-- > > > fs/notify/inotify/inotify_user.c | 5 +++-- > > > include/linux/fsnotify_backend.h | 2 ++ > > > 3 files changed, 17 insertions(+), 4 deletions(-) > > > > > > diff --git a/fs/notify/group.c b/fs/notify/group.c > > > index a4a4b1c64d32..fab3cfdb4d9e 100644 > > > --- a/fs/notify/group.c > > > +++ b/fs/notify/group.c > > > @@ -114,11 +114,12 @@ EXPORT_SYMBOL_GPL(fsnotify_put_group); > > > /* > > > * Create a new fsnotify_group and hold a reference for the group returned. > > > */ > > > -struct fsnotify_group *fsnotify_alloc_group(const struct fsnotify_ops *ops) > > > +struct fsnotify_group *fsnotify_alloc_group_gfp(const struct fsnotify_ops *ops, > > > + gfp_t gfp) > > > { > > > struct fsnotify_group *group; > > > > > > - group = kzalloc(sizeof(struct fsnotify_group), GFP_KERNEL); > > > + group = kzalloc(sizeof(struct fsnotify_group), gfp); > > > if (!group) > > > return ERR_PTR(-ENOMEM); > > > > > > @@ -139,6 +140,15 @@ struct fsnotify_group *fsnotify_alloc_group(const struct fsnotify_ops *ops) > > > > > > return group; > > > } > > > +EXPORT_SYMBOL_GPL(fsnotify_alloc_group_gfp); > > > + > > > +/* > > > + * Create a new fsnotify_group and hold a reference for the group returned. > > > + */ > > > +struct fsnotify_group *fsnotify_alloc_group(const struct fsnotify_ops *ops) > > > +{ > > > + return fsnotify_alloc_group_gfp(ops, GFP_KERNEL); > > > +} > > > EXPORT_SYMBOL_GPL(fsnotify_alloc_group); > > > > > > int fsnotify_fasync(int fd, struct file *file, int on) > > > diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c > > > index 59c177011a0f..7cb528c6154c 100644 > > > --- a/fs/notify/inotify/inotify_user.c > > > +++ b/fs/notify/inotify/inotify_user.c > > > @@ -632,11 +632,12 @@ static struct fsnotify_group *inotify_new_group(unsigned int max_events) > > > struct fsnotify_group *group; > > > struct inotify_event_info *oevent; > > > > > > - group = fsnotify_alloc_group(&inotify_fsnotify_ops); > > > + group = fsnotify_alloc_group_gfp(&inotify_fsnotify_ops, > > > + GFP_KERNEL_ACCOUNT); > > > if (IS_ERR(group)) > > > return group; > > > > > > - oevent = kmalloc(sizeof(struct inotify_event_info), GFP_KERNEL); > > > + oevent = kmalloc(sizeof(struct inotify_event_info), GFP_KERNEL_ACCOUNT); > > > if (unlikely(!oevent)) { > > > fsnotify_destroy_group(group); > > > return ERR_PTR(-ENOMEM); > > > > Any reason why you did not include fanotify in this patch? > > The motivation was inotify's max_user_instances but we can charge > fsnotify_group for fanotify as well. Though I would prefer that to be > a separate patch. Let me know what you prefer? > I would prefer to add the helper fsnotify_alloc_user_group() that will use the GFP_KERNEL_ACCOUNT allocation flags internally. fsnotify_alloc_group() is used by all backends that initialize a single group instance for internal use and fsnotify_alloc_user_group() will be used by inotify/fanotify when users create instances. I see no reason to separate that to two patches. Thanks, Amir.