On Tue, Jul 14, 2020 at 5:54 AM Francesco Ruggeri <fruggeri@xxxxxxxxxx> wrote: > > We are getting this soft lockup in fanotify_read. > The reason is that this code does not seem to scale to cases where there > are big bursts of events generated by fanotify_handle_event. > fanotify_read acquires group->notification_lock for each event. > fanotify_handle_event uses the lock to add one event, which also involves > fanotify_merge, which scans the whole list trying to find an event to > merge the new one with. Yes, that is a terribly inefficient merge algorithm. If it helps I am carrying a quick brown paper bag fix for this issue in my tree: @@ -65,6 +74,8 @@ static int fanotify_merge(struct list_head *list, struct fsnotify_event *event) { struct fsnotify_event *test_event; struct fanotify_event *new; + int limit = 128; + int i = 0; pr_debug("%s: list=%p event=%p\n", __func__, list, event); new = FANOTIFY_E(event); @@ -78,6 +89,9 @@ static int fanotify_merge(struct list_head *list, struct fsnotify_event *event) return 0; list_for_each_entry_reverse(test_event, list, list) { + /* Event merges are expensive so should be limited */ + if (++i > limit) + break; if (should_merge(test_event, event)) { It's somewhere down my TODO list to fix this properly with a hash table. > In our case fanotify_read is invoked with a buffer big enough for 200 > events, and what happens is that every time fanotify_read dequeues an > event and releases the lock, fanotify_handle_event adds several more, > scanning a longer and longer list. This causes fanotify_read to wait > longer and longer for the lock, and the soft lockup happens before > fanotify_read can reach 200 events. > Is it intentional for fanotify_read to acquire the lock for each event, > rather than batching together a user buffer worth of events? I think it is meant to allow for multiple reader threads to read events with fairness, but not sure. Even if it was fine to read a batch of events on every spinlock acquire making the code in the fanotify_read() loop behave well in case of an error in an event after reading a bunch of good events looks challenging, but I didn't try. Anyway, the root cause of the issue seems to be the inefficient merge and not the spinlock taken per one event read. Thanks, Amir.