On Wed, Nov 9, 2016 at 1:10 PM, Jan Kara <jack@xxxxxxx> wrote: > On Sun 06-11-16 08:45:54, Amir Goldstein wrote: >> On Sat, Nov 5, 2016 at 11:34 PM, Jan Kara <jack@xxxxxxx> wrote: >> > On Wed 02-11-16 23:09:26, Miklos Szeredi wrote: >> >> We've got a report where a fanotify daemon that implements permission checks >> >> screws up and doesn't send a reply. This then causes widespread hangs due to >> >> fsnotify_mark_srcu read side lock being held and thus causing synchronize_srcu() >> >> called from e.g. inotify_release()-> fsnotify_destroy_group()-> >> >> fsnotify_mark_destroy_list() to block. >> > >> > Yes. But if a program implementing permission checks does not reply, your >> > system is likely hosed anyway. We can only try to somewhat limit the >> > damage... >> > >> >> That was my initial thought as well, but at least with the sample code >> Miklos sent >> the only thing that gets hosed is the one process watching that one file. >> You could think of a use case of fanotify being used to watch over files >> in a specific user directory, where the damage on the entire system >> should/could be limited. No? > > Yes, the damage could at least theoretically be limited only to those files > / dirs watched by the buggy process. > >> >> Below program demonstrates the issue. It should output a single line: >> >> >> >> close(inotify_fd): success >> >> >> >> Instead it outputs nothing, which means that close(inotify_fd) got blocked by >> >> the waiting permission event. >> >> >> >> Wouldn't making the srcu per-group fix this? Would that be too expensive? >> > >> > Per-group would be IMHO too expensive. You can have lots of groups and I'm >> > not sure srcu would scale to that. Furthermore the SRCU protects the list >> > of groups that need to get notification so it would not even be easily >> > possible. Also Amir's solution is buggy - I'll comment on that as a reply >> > to his patch. I'll try to find something to improve the situation but so >> > far I have no good idea... >> > >> >> Yes, very much buggy indeed :/ >> Anyway, the reason I drafted it quickly was to highlight the fact that the >> marks only need to live to the point of decision whether or not the event >> should be sent to the group and afterwards, its sufficient to grab the >> group reference, without having impact on the entire system. > > Yes, fanotify code as such does not need the marks anymore. But the core > fsnotify code does... > >> Yet another possible ugly (but less buggy) solution would be >> to iterate all marks under SRCU read protection. >> If any group is about to block (either by suggested return value >> EAGAIN or another >> by using a new op should_handle_event_deferred), defer event handling to post >> marks iteration, by keeping a few group references on stack. > > And this does not work as well... Fanotify must notify groups by their > priority so you cannot arbitrarily reorder ordering in which groups get > notified. I'm currently pondering on using mark refcount to pin it when > processing permission event but there are still some details to check. > All right, mark refcount sound like the proper solution. In case this approach doesn't work out for some reason, you may want to consider fsnotify_mark_srcu (and destroy_list) per priority. Or just 2 srcu, one for for priority 0 and one for other. Because priority > 0 may block and priority 0 may not block. The priority set on an inode/vfsmount can be easily encoded into the i_fsnotify_mask/mnt_fsnotify_mask, e.g.: #define FS_GROUP_PRIO_1 0x00040000 /* fanotify content based access control */ #define FS_GROUP_PRIO_2 0x00080000 /* fanotify pre-content access */ in fsnotify(), only need to take the read side srcu of relevant priority groups, but need to take extra care to set the priority bit in the inode/mnt mask *before* adding the mark to the list, with a relevant memory barrier before checking the priority bits in fsnotify(). Amir. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html