> > So I have implemented this idea on fanotify_userns branch and the cost > > per "filtered" sb mark is quite low - its a pretty cheap check in > > send_to_group() > > But still, if an unbound number of users can add to the sb mark list it is > > not going to end well. > > Thinking out loud: So what is the cost going to be for the side generating > events? Ideally it would of O(number of fanotify groups receiving event). > We cannot get better than that and if the constants aren't too big I think > this is acceptable overhead. Sure this can mean total work of (number of > users) * (max number of subtree marks per user) for queueing notification > event but I don't think it is practical for a DoS attack and I also don't > think that in practice users will be watching overlapping subtrees that > much. > Why overlapping? My concern is not so much from DoS attacks. My concern is more from innocent users adding unacceptable accumulated overhead. Think of a filesystem mounted at /home/ with 100K directories at /home/user$N, every user gets its own idmapped mount from systemd-homed and may or may not choose to run a listener to get events generated under its own home dir (which is an idmapped mount). Even if we limit one sb mask per user, we can still have 100K marks list in sb. For this reason I think we need to limit the number of marks per sb. The simple way is a global config like max_queued_events, but I think we can do better than that. > The question is whether we can get that fast. Probably not because that > would have to attach subtree watches to directory inodes or otherwise > filter out unrelated fanotify groups in O(1). But the complexity of > O(number of groups receiving events + depth of dir where event is happening) > is probably achievable - we'd walk the tree up and have roots of watched > subtrees marked. What do you think? > I am for that. I already posted a POC along those lines [1]. I was just not sure how to limit the potential accumulated overhead. [1] https://github.com/amir73il/linux/commits/fanotify_subtree_mark > Also there is a somewhat related question what is the semantics of subtree > watches in presence of mounts - do subtree watches "see through" mount > points? Probably not but then with bind mounts this can be sometimes > inconvenient / confusing - e.g. if I have /tmp bind-mounted to /var/tmp and > I'm watching subtree of /var, I would not get events for what's in > /var/tmp... Which is logical if you spell it out like this but applications > often don't care how the mount hierarchy looks like, they just care about > locally visible directory structure. Those are hard questions. I think that userns/mountns developers needed to address them a while ago and I think there are some helpers that help with checking visibility of paths. > > > <hand waving> > > I think what we need here (thinking out loud) is to account the sb marks > > to the user that mounted the filesystem or to the user mapped to admin using > > idmapped mount, maybe to both(?), probably using a separate ucount entry > > (e.g. max_fanotify_filesystem_marks). > > I'm somewhat lost here. Are these two users different? We have /home/foo > which is a mounted filesystem. AFAIU it will be mounted in a special user > namespace for user 'foo' - let's call is 'foo-ns'. /home/foo has idmapping > attached so system [ug]ids and non-trivially mapped to on-disk [ug]ids. Now > we have a user - let's call it 'foo-usr' that has enough capabilities > (whatever they are) in 'foo-ns' to place fanotify subtree marks in > /home/foo. So these marks are naturally accounted towards 'foo-usr'. To > whom else you'd like to also account these marks and why? > I would like the system admin to be able to limit 100 sb marks on /home (filtered or not) because that impacts the send_to_group iteration. I would also like systemd to be able to grant a smaller quota of filtered sb marks per user when creating and mapping the idmapped mounts at /home/foo$N I *think* we can achieve that, by accounting the sb marks to uid 0 (who mounted /home) in ucounts entry "fanotify_sb_marks". If /home would have been a FS_USERNS_MOUNT mounted inside some userns, then all its sb marks would be accounted to uid 0 of that userns. I have no ideas if this all adds up. My head explodes even from trying to express these rules :-/ Thanks, Amir.