On Thu, Aug 5, 2021 at 3:15 PM Jan Kara <jack@xxxxxxx> wrote: > > On Wed 04-08-21 12:06:07, Gabriel Krisman Bertazi wrote: > > Wire up FAN_FS_ERROR in the fanotify_mark syscall. The event can only > > be requested for the entire filesystem, thus it requires the > > FAN_MARK_FILESYSTEM. > > > > FAN_FS_ERROR has to be handled slightly differently from other events > > because it needs to be submitted in an atomic context, using > > preallocated memory. This patch implements the submission path by only > > storing the first error event that happened in the slot (userspace > > resets the slot by reading the event). > > > > Extra error events happening when the slot is occupied are merged to the > > original report, and the only information keep for these extra errors is > > an accumulator counting the number of events, which is part of the > > record reported back to userspace. > > > > Reporting only the first event should be fine, since when a FS error > > happens, a cascade of error usually follows, but the most meaningful > > information is (usually) on the first erro. > > > > The event dequeueing is also a bit special to avoid losing events. Since > > event merging only happens while the event is queued, there is a window > > between when an error event is dequeued (notification_lock is dropped) > > until it is reset (.free_event()) where the slot is full, but no merges > > can happen. > > > > The proposed solution is to replace the event in the slot with a new > > structure, prior to dropping the lock. This way, if a new event arrives > > in the time between the event was dequeued and the time it resets, the > > new errors will still be logged and merged in the new slot. > > > > Signed-off-by: Gabriel Krisman Bertazi <krisman@xxxxxxxxxxxxx> > > The splitting of the patches really helped. Now I think I can grok much > more details than before :) Thanks! Some comments below. > > > diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c > > index 0678d35432a7..4e9e271a4394 100644 > > --- a/fs/notify/fanotify/fanotify.c > > +++ b/fs/notify/fanotify/fanotify.c > > @@ -681,6 +681,42 @@ static __kernel_fsid_t fanotify_get_fsid(struct fsnotify_iter_info *iter_info) > > return fsid; > > } > > > > +static int fanotify_merge_error_event(struct fsnotify_group *group, > > + struct fsnotify_event *event) > > +{ > > + struct fanotify_event *fae = FANOTIFY_E(event); > > + struct fanotify_error_event *fee = FANOTIFY_EE(fae); > > + > > + /* > > + * When err_count > 0, the reporting slot is full. Just account > > + * the additional error and abort the insertion. > > + */ > > + if (fee->err_count) { > > + fee->err_count++; > > + return 1; > > + } > > + > > + return 0; > > +} > > + > > +static void fanotify_insert_error_event(struct fsnotify_group *group, > > + struct fsnotify_event *event, > > + const void *data) > > +{ > > + const struct fs_error_report *report = (struct fs_error_report *) data; > > + struct fanotify_event *fae = FANOTIFY_E(event); > > + struct fanotify_error_event *fee; > > + > > + /* This might be an unexpected type of event (i.e. overflow). */ > > + if (!fanotify_is_error_event(fae->mask)) > > + return; > > + > > + fee = FANOTIFY_EE(fae); > > + fee->fae.type = FANOTIFY_EVENT_TYPE_FS_ERROR; > > + fee->error = report->error; > > + fee->err_count = 1; > > +} > > + > > /* > > * Add an event to hash table for faster merge. > > */ > > @@ -735,7 +771,7 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask, > > BUILD_BUG_ON(FAN_OPEN_EXEC_PERM != FS_OPEN_EXEC_PERM); > > BUILD_BUG_ON(FAN_FS_ERROR != FS_ERROR); > > > > - BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 19); > > + BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 20); > > > > mask = fanotify_group_event_mask(group, iter_info, mask, data, > > data_type, dir); > > @@ -760,6 +796,18 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask, > > return 0; > > } > > > > + if (fanotify_is_error_event(mask)) { > > + struct fanotify_sb_mark *sb_mark = > > + FANOTIFY_SB_MARK(fsnotify_iter_sb_mark(iter_info)); > > + > > + ret = fsnotify_insert_event(group, > > + &sb_mark->fee_slot->fae.fse, > > + fanotify_merge_error_event, > > + fanotify_insert_error_event, > > + data); > > + goto finish; > > + } > > Hum, seeing this and how you had to extend fsnotify_add_event() to > accommodate this use, cannot we instead have something like: > > if (fanotify_is_error_event(mask)) { > struct fanotify_sb_mark *sb_mark = > FANOTIFY_SB_MARK(fsnotify_iter_sb_mark(iter_info)); > struct fanotify_error_event *event = &sb_mark->fee_slot; > bool queue = false; > > spin_lock(&group->notification_lock); > /* Not yet queued? */ > if (!event->err_count) { > fee->error = report->error; > queue = true; > } > event->err_count++; > spin_unlock(&group->notification_lock); > if (queue) { > ... fill in other error info in 'event' such as fhandle > fsnotify_add_event(group, &event->fae.fse, NULL); > } > } > > It would be IMHO simpler to follow what's going on and we don't have to > touch fsnotify_add_event(). I do recognize that due to races it may happen > that some racing fsnotify(FAN_FS_ERROR) call returns before the event is > actually visible in the event queue. It don't think it really matters but > if we wanted to be more careful, we would need to preformat fhandle into a > local buffer and only copy it into the event under notification_lock when > we see the event is unused. > > > +/* > > + * Replace a mark's error event with a new structure in preparation for > > + * it to be dequeued. This is a bit annoying since we need to drop the > > + * lock, so another thread might just steal the event from us. > > + */ > > +static int fanotify_replace_fs_error_event(struct fsnotify_group *group, > > + struct fanotify_event *fae) > > +{ > > + struct fanotify_error_event *new, *fee = FANOTIFY_EE(fae); > > + struct fanotify_sb_mark *sb_mark = fee->sb_mark; > > + struct fsnotify_event *fse; > > + > > + pr_debug("%s: event=%p\n", __func__, fae); > > + > > + assert_spin_locked(&group->notification_lock); > > + > > + spin_unlock(&group->notification_lock); > > + new = fanotify_alloc_error_event(sb_mark); > > + spin_lock(&group->notification_lock); > > + > > + if (!new) > > + return -ENOMEM; > > + > > + /* > > + * Since we temporarily dropped the notification_lock, the event > > + * might have been taken from under us and reported by another > > + * reader. If that is the case, don't play games, just retry. > > + */ > > + fse = fsnotify_peek_first_event(group); > > + if (fse != &fae->fse) { > > + kfree(new); > > + return -EAGAIN; > > + } > > + > > + sb_mark->fee_slot = new; > > + > > + return 0; > > +} > > + > > /* > > * Get an fanotify notification event if one exists and is small > > * enough to fit in "count". Return an error pointer if the count > > @@ -212,9 +252,21 @@ static struct fanotify_event *get_one_event(struct fsnotify_group *group, > > goto out; > > } > > > > + if (fanotify_is_error_event(event->mask)) { > > + /* > > + * Replace the error event ahead of dequeueing so we > > + * don't need to handle a incorrectly dequeued event. > > + */ > > + ret = fanotify_replace_fs_error_event(group, event); > > + if (ret) { > > + event = ERR_PTR(ret); > > + goto out; > > + } > > + } > > + > > The replacing, retry, and all is hairy. Cannot we just keep the same event > attached to the sb mark and copy-out to on-stack buffer under > notification_lock in get_one_event()? The event is big (due to fhandle) but > fanotify_read() is not called from a deep call chain so we should have > enough space on stack for that. > For the record, this was one of the first implementations from Gabriel. When I proposed the double buffer implementation it was either that or go back to copy to stack. Given the complications, I agree that going back to copy to stack is preferred. Thanks, Amir.