On Wed 09-11-16 20:26:16, Amir Goldstein wrote: > On Wed, Nov 9, 2016 at 1:10 PM, Jan Kara <jack@xxxxxxx> wrote: > > On Sun 06-11-16 08:45:54, Amir Goldstein wrote: > >> On Sat, Nov 5, 2016 at 11:34 PM, Jan Kara <jack@xxxxxxx> wrote: > >> > On Wed 02-11-16 23:09:26, Miklos Szeredi wrote: > >> >> We've got a report where a fanotify daemon that implements permission checks > >> >> screws up and doesn't send a reply. This then causes widespread hangs due to > >> >> fsnotify_mark_srcu read side lock being held and thus causing synchronize_srcu() > >> >> called from e.g. inotify_release()-> fsnotify_destroy_group()-> > >> >> fsnotify_mark_destroy_list() to block. > >> > > >> > Yes. But if a program implementing permission checks does not reply, your > >> > system is likely hosed anyway. We can only try to somewhat limit the > >> > damage... > >> > > >> > >> That was my initial thought as well, but at least with the sample code > >> Miklos sent > >> the only thing that gets hosed is the one process watching that one file. > >> You could think of a use case of fanotify being used to watch over files > >> in a specific user directory, where the damage on the entire system > >> should/could be limited. No? > > > > Yes, the damage could at least theoretically be limited only to those files > > / dirs watched by the buggy process. > > > >> >> Below program demonstrates the issue. It should output a single line: > >> >> > >> >> close(inotify_fd): success > >> >> > >> >> Instead it outputs nothing, which means that close(inotify_fd) got blocked by > >> >> the waiting permission event. > >> >> > >> >> Wouldn't making the srcu per-group fix this? Would that be too expensive? > >> > > >> > Per-group would be IMHO too expensive. You can have lots of groups and I'm > >> > not sure srcu would scale to that. Furthermore the SRCU protects the list > >> > of groups that need to get notification so it would not even be easily > >> > possible. Also Amir's solution is buggy - I'll comment on that as a reply > >> > to his patch. I'll try to find something to improve the situation but so > >> > far I have no good idea... > >> > > >> > >> Yes, very much buggy indeed :/ > >> Anyway, the reason I drafted it quickly was to highlight the fact that the > >> marks only need to live to the point of decision whether or not the event > >> should be sent to the group and afterwards, its sufficient to grab the > >> group reference, without having impact on the entire system. > > > > Yes, fanotify code as such does not need the marks anymore. But the core > > fsnotify code does... > > > >> Yet another possible ugly (but less buggy) solution would be > >> to iterate all marks under SRCU read protection. > >> If any group is about to block (either by suggested return value > >> EAGAIN or another > >> by using a new op should_handle_event_deferred), defer event handling to post > >> marks iteration, by keeping a few group references on stack. > > > > And this does not work as well... Fanotify must notify groups by their > > priority so you cannot arbitrarily reorder ordering in which groups get > > notified. I'm currently pondering on using mark refcount to pin it when > > processing permission event but there are still some details to check. > > > > All right, mark refcount sound like the proper solution. Except it doesn't quite work. We can pin the current marks by a refcount but they can still be removed from the list so after we regain srcu lock, we are not sure their ->next pointers still point to still allocated marks :-| Sadly I realized this only after implementing all this. > In case this approach doesn't work out for some reason, you may want > to consider fsnotify_mark_srcu (and destroy_list) per priority. > Or just 2 srcu, one for for priority 0 and one for other. > Because priority > 0 may block and priority 0 may not block. > > The priority set on an inode/vfsmount can be easily encoded into the > i_fsnotify_mask/mnt_fsnotify_mask, e.g.: > #define FS_GROUP_PRIO_1 0x00040000 /* fanotify content > based access control */ > #define FS_GROUP_PRIO_2 0x00080000 /* fanotify > pre-content access */ > > in fsnotify(), only need to take the read side srcu of relevant priority groups, > but need to take extra care to set the priority bit in the inode/mnt > mask *before* > adding the mark to the list, with a relevant memory barrier before checking > the priority bits in fsnotify(). Well but how would you like to protect the mark list hanging off the inode / mountpoint with two SRCUs? You'd need two lists hanging off the inode & mountpoint (for different priorities) and that's too big cost to pay to accomodate broken userspace... Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html