Re: [PATCH 2/7] fsnotify: support hashed notification queue

Amir Goldstein <amir73il@xxxxxxxxx> · Wed, 17 Feb 2021 14:33:46 +0200

On Tue, Feb 16, 2021 at 5:02 PM Jan Kara <jack@xxxxxxx> wrote:
>
> On Tue 02-02-21 18:20:05, Amir Goldstein wrote:
> > In order to improve event merge performance, support sharding the
> > notification queue to several notification lists, hashed by an event
> > merge key.
> >
> > The fanotify event merge key is the object id reduced to 32bit hash.
> >
> > At the moment, both inotify and fanotify still use a single notification
> > list.
> >
> > At the moment, reading events from the hashed queue is not by event
> > insertion order.  A non-empty list is read until it is empty.
> >
> > The max events limit is accounted for total events in all lists.
> >
> > Signed-off-by: Amir Goldstein <amir73il@xxxxxxxxx>
>
> Style nit: Please try to stay within 80 columns...
>
> Also I think this change would be more comprehensible if it was merged with
> the following patch 3/8. Having code with TODOs is kind of awkward and
> makes it more difficult to verify the result is actually correct.
>

Ok.

> > diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> > index 8ff27d92d32c..dee12d927f8d 100644
> > --- a/fs/notify/fanotify/fanotify_user.c
> > +++ b/fs/notify/fanotify/fanotify_user.c
> > @@ -635,6 +635,7 @@ static long fanotify_ioctl(struct file *file, unsigned int cmd, unsigned long ar
> >  {
> >       struct fsnotify_group *group;
> >       struct fsnotify_event *fsn_event;
> > +     struct list_head *list;
> >       void __user *p;
> >       int ret = -ENOTTY;
> >       size_t send_len = 0;
> > @@ -646,8 +647,15 @@ static long fanotify_ioctl(struct file *file, unsigned int cmd, unsigned long ar
> >       switch (cmd) {
> >       case FIONREAD:
> >               spin_lock(&group->notification_lock);
> > -             list_for_each_entry(fsn_event, &group->notification_list, list)
> > -                     send_len += FAN_EVENT_METADATA_LEN;
> > +             list = fsnotify_first_notification_list(group);
> > +             /*
> > +              * With multi queue, send_len will be a lower bound
> > +              * on total events size.
> > +              */
> > +             if (list) {
> > +                     list_for_each_entry(fsn_event, list, list)
> > +                             send_len += FAN_EVENT_METADATA_LEN;
> > +             }
>
> IMO we should just walk all the lists here. Otherwise reported length would
> be just odd. That being said the returned value already is odd because we
> don't properly account for different event sizes (unrelated problem). So if
> we want to keep it simple for now, we can just return group->num_events *
> FAN_EVENT_METADATA_LEN, can't we?

Sure. Makes sense.

>
> > diff --git a/fs/notify/group.c b/fs/notify/group.c
> > index ffd723ffe46d..b41ed68f9ff9 100644
> > --- a/fs/notify/group.c
> > +++ b/fs/notify/group.c
> > @@ -111,12 +116,20 @@ void fsnotify_put_group(struct fsnotify_group *group)
> >  }
> >  EXPORT_SYMBOL_GPL(fsnotify_put_group);
> >
> > -static struct fsnotify_group *__fsnotify_alloc_group(
> > +static struct fsnotify_group *__fsnotify_alloc_group(unsigned int q_hash_bits,
> >                               const struct fsnotify_ops *ops, gfp_t gfp)
> >  {
> >       struct fsnotify_group *group;
> > +     int i;
> > +
> > +#ifdef FSNOTIFY_HASHED_QUEUE
> > +     if (WARN_ON_ONCE(q_hash_bits > FSNOTIFY_HASHED_QUEUE_MAX_BITS))
> > +             q_hash_bits = FSNOTIFY_HASHED_QUEUE_MAX_BITS;
> > +#else
> > +     q_hash_bits = 0;
> > +#endif
>
> Why the check for q_hash_bits? We have in-kernel users only so I would not
> be that paranoid :) Maybe I'd just let the group specify whether it wants
> hashed queue or not and for hashed queues always set number of buckets to
> 128. So far we don't need anything more flexible and if we ever do, it's
> easy enough to change the in-kernel API...

Ok.

>
> Also, honestly, I find these ifdefs ugly. I'd just leave hashed queues
> unconditionally compiled in.
>

Ok.

> > -     group = kzalloc(sizeof(struct fsnotify_group), gfp);
> > +     group = kzalloc(fsnotify_group_size(q_hash_bits), gfp);
> >       if (!group)
> >               return ERR_PTR(-ENOMEM);
> >
> > @@ -126,8 +139,12 @@ static struct fsnotify_group *__fsnotify_alloc_group(
> >       atomic_set(&group->user_waits, 0);
> >
> >       spin_lock_init(&group->notification_lock);
> > -     INIT_LIST_HEAD(&group->notification_list);
> >       init_waitqueue_head(&group->notification_waitq);
> > +     /* Initialize one or more notification lists */
> > +     group->q_hash_bits = q_hash_bits;
> > +     group->max_bucket = (1 << q_hash_bits) - 1;
> > +     for (i = 0; i <= group->max_bucket; i++)
> > +             INIT_LIST_HEAD(&group->notification_list[i]);
> >       group->max_events = UINT_MAX;
> >
> >       mutex_init(&group->mark_mutex);
> > @@ -139,20 +156,24 @@ static struct fsnotify_group *__fsnotify_alloc_group(
> >  }
> >
> >  /*
> > - * Create a new fsnotify_group and hold a reference for the group returned.
> > + * Create a new fsnotify_group with no events queue.
>
> How come "no events queue"? There will be always at least one queue...
>

Bad phrasing. The backends that use fsnotify_alloc_group() do not
use the events queue.
I considered not allocating any event queue for them but decided it is
not worth the risk of introducing bugs.

> > + * Hold a reference for the group returned.
> >   */
> >  struct fsnotify_group *fsnotify_alloc_group(const struct fsnotify_ops *ops)
> >  {
> > -     return __fsnotify_alloc_group(ops, GFP_KERNEL);
> > +     return __fsnotify_alloc_group(0, ops, GFP_KERNEL);
> >  }
> >  EXPORT_SYMBOL_GPL(fsnotify_alloc_group);
> >
> >  /*
> > - * Create a new fsnotify_group and hold a reference for the group returned.
> > + * Create a new fsnotify_group with an events queue.
> > + * If @q_hash_bits > 0, the queue is shareded into several notification lists.
> > + * Hold a reference for the group returned.
> >   */
> > -struct fsnotify_group *fsnotify_alloc_user_group(const struct fsnotify_ops *ops)
> > +struct fsnotify_group *fsnotify_alloc_user_group(unsigned int q_hash_bits,
> > +                                              const struct fsnotify_ops *ops)
> >  {
> > -     return __fsnotify_alloc_group(ops, GFP_KERNEL_ACCOUNT);
> > +     return __fsnotify_alloc_group(q_hash_bits, ops, GFP_KERNEL_ACCOUNT);
> >  }
> >  EXPORT_SYMBOL_GPL(fsnotify_alloc_user_group);
> >
> > diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
> > index d8830be60a9b..1c476b9485dc 100644
> > --- a/fs/notify/inotify/inotify_user.c
> > +++ b/fs/notify/inotify/inotify_user.c
> > @@ -288,6 +288,7 @@ static long inotify_ioctl(struct file *file, unsigned int cmd,
> >  {
> >       struct fsnotify_group *group;
> >       struct fsnotify_event *fsn_event;
> > +     struct list_head *list;
> >       void __user *p;
> >       int ret = -ENOTTY;
> >       size_t send_len = 0;
> > @@ -300,10 +301,16 @@ static long inotify_ioctl(struct file *file, unsigned int cmd,
> >       switch (cmd) {
> >       case FIONREAD:
> >               spin_lock(&group->notification_lock);
> > -             list_for_each_entry(fsn_event, &group->notification_list,
> > -                                 list) {
> > -                     send_len += sizeof(struct inotify_event);
> > -                     send_len += round_event_name_len(fsn_event);
> > +             list = fsnotify_first_notification_list(group);
> > +             /*
> > +              * With multi queue, send_len will be a lower bound
> > +              * on total events size.
> > +              */
> > +             if (list) {
> > +                     list_for_each_entry(fsn_event, list, list) {
> > +                             send_len += sizeof(struct inotify_event);
> > +                             send_len += round_event_name_len(fsn_event);
> > +                     }
>
> As I write below IMO we should enable hashed queues also for inotify (is
> there good reason not to?)

I see your perception of inotify_merge() is the same as mine was
when I wrote a patch to support hashed queues for inotify.
It is only after that I realized that inotify_merge() only ever merges
with the last event and I dropped that patch.
I see no reason to change this long time behavior.

> and here we should just walk through all the lists.

Ok.

>
> >               }
> >               spin_unlock(&group->notification_lock);
> >               ret = put_user(send_len, (int __user *) p);
> > @@ -631,7 +638,7 @@ static struct fsnotify_group *inotify_new_group(unsigned int max_events)
> >       struct fsnotify_group *group;
> >       struct inotify_event_info *oevent;
> >
> > -     group = fsnotify_alloc_user_group(&inotify_fsnotify_ops);
> > +     group = fsnotify_alloc_user_group(0, &inotify_fsnotify_ops);
> >       if (IS_ERR(group))
> >               return group;
> >
> > diff --git a/fs/notify/notification.c b/fs/notify/notification.c
> > index fcac2d72cbf5..58c8f6c1be1b 100644
> > --- a/fs/notify/notification.c
> > +++ b/fs/notify/notification.c
> ...
> > @@ -74,10 +67,75 @@ void fsnotify_destroy_event(struct fsnotify_group *group,
> >       group->ops->free_event(event);
> >  }
> >
> > +/* Check and fix inconsistencies in hashed queue */
> > +static void fsnotify_queue_check(struct fsnotify_group *group)
> > +{
> > +#ifdef FSNOTIFY_HASHED_QUEUE
> > +     struct list_head *list;
> > +     int i, nbuckets = 0;
> > +     bool first_empty, last_empty;
> > +
> > +     assert_spin_locked(&group->notification_lock);
> > +
> > +     pr_debug("%s: group=%p events: num=%u max=%u buckets: first=%u last=%u max=%u\n",
> > +              __func__, group, group->num_events, group->max_events,
> > +              group->first_bucket, group->last_bucket, group->max_bucket);
> > +
> > +     if (fsnotify_notify_queue_is_empty(group))
> > +             return;
> > +
> > +     first_empty = list_empty(&group->notification_list[group->first_bucket]);
> > +     last_empty = list_empty(&group->notification_list[group->last_bucket]);
> > +
> > +     list = &group->notification_list[0];
> > +     for (i = 0; i <= group->max_bucket; i++, list++) {
> > +             if (list_empty(list))
> > +                     continue;
> > +             if (nbuckets++)
> > +                     continue;
> > +             if (first_empty)
> > +                     group->first_bucket = i;
> > +             if (last_empty)
> > +                     group->last_bucket = i;
> > +     }
> > +
> > +     pr_debug("%s: %u non-empty buckets\n", __func__, nbuckets);
> > +
> > +     /* All buckets are empty, but non-zero num_events? */
> > +     if (WARN_ON_ONCE(!nbuckets && group->num_events))
> > +             group->num_events = 0;
> > +#endif
>
> Hum, what is this function about? I understand you might want to have this
> around for debugging when developing the patch but is there are legitimate
> reason for this in production?
>
> I understand current patch uses this function for searching for next
> non-empty list but after this patch is merged with the next one, is there
> still a need for this?

Only needed for debugging (added by last patch).
I wanted to make sure that hash is balanced.
I'll keep this code around only for the debug print on queue overflow.

>
> > @@ -147,17 +228,21 @@ void fsnotify_remove_queued_event(struct fsnotify_group *group,
> >  struct fsnotify_event *fsnotify_remove_first_event(struct fsnotify_group *group)
> >  {
> >       struct fsnotify_event *event;
> > +     struct list_head *list;
> >
> >       assert_spin_locked(&group->notification_lock);
> >
> > -     if (fsnotify_notify_queue_is_empty(group))
> > +     list = fsnotify_first_notification_list(group);
> > +     if (!list)
> >               return NULL;
> >
> > -     pr_debug("%s: group=%p\n", __func__, group);
> > +     pr_debug("%s: group=%p bucket=%u\n", __func__, group, group->first_bucket);
> >
> > -     event = list_first_entry(&group->notification_list,
> > -                              struct fsnotify_event, list);
> > +     event = list_first_entry(list, struct fsnotify_event, list);
>
> Perhaps the above could just reuse fsnotify_peek_first_event()? It is now
> complex enough to be worth it I guess.
>
> >       fsnotify_remove_queued_event(group, event);
> > +     /*
> > +      * TODO: update group->first_bucket to next_bucket in first event.
> > +      */
> >       return event;
> >  }
> >
> > @@ -167,13 +252,15 @@ struct fsnotify_event *fsnotify_remove_first_event(struct fsnotify_group *group)
> >   */
> >  struct fsnotify_event *fsnotify_peek_first_event(struct fsnotify_group *group)
> >  {
> > +     struct list_head *list;
> > +
> >       assert_spin_locked(&group->notification_lock);
> >
> > -     if (fsnotify_notify_queue_is_empty(group))
> > +     list = fsnotify_first_notification_list(group);
> > +     if (!list)
> >               return NULL;
> >
> > -     return list_first_entry(&group->notification_list,
> > -                             struct fsnotify_event, list);
> > +     return list_first_entry(list, struct fsnotify_event, list);
> >  }
> >
> >  /*
> > diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
> > index e5409b83e731..b2a80bc00108 100644
> > --- a/include/linux/fsnotify_backend.h
> > +++ b/include/linux/fsnotify_backend.h
> > @@ -160,6 +160,11 @@ struct fsnotify_ops {
> >       void (*free_mark)(struct fsnotify_mark *mark);
> >  };
> >
> > +#ifdef CONFIG_FANOTIFY
> > +#define FSNOTIFY_HASHED_QUEUE
> > +#define FSNOTIFY_HASHED_QUEUE_MAX_BITS 8
> > +#endif
> > +
>
> Would not inotify benefit from this work as well?

See above. inotify is not doing linear search.

> Also I'd just compile in
> hashed queues unconditionally. The ifdefs are ugly and IMHO not worth it.
>

OK.

> ...
>
> > +static inline size_t fsnotify_group_size(unsigned int q_hash_bits)
> > +{
> > +     return sizeof(struct fsnotify_group) + (sizeof(struct list_head) << q_hash_bits);
> > +}
> > +
> > +static inline unsigned int fsnotify_event_bucket(struct fsnotify_group *group,
> > +                                              struct fsnotify_event *event)
> > +{
> > +     /* High bits are better for hash */
> > +     return (event->key >> (32 - group->q_hash_bits)) & group->max_bucket;
> > +}
>
> Why not use hash_32() here? IMHO better than just stripping bits...

See hash_ptr(). There is a reason to use the highest bits.

>
> > +
> > +static inline struct list_head *fsnotify_event_notification_list(
> > +                                             struct fsnotify_group *group,
> > +                                             struct fsnotify_event *event)
> > +{
> > +     return &group->notification_list[fsnotify_event_bucket(group, event)];
> > +}
> > +
>
> Any reason for this to be in a header? Will it ever have other user than
> fsnotify_add_event()?

I guess not. I went thought several alternative patches.
It may have made sense in another variant.

Thanks,
Amir.