Re: fanotify - overall design before I start sending patches

Eric Paris <eparis@xxxxxxxxxx> · Fri, 24 Jul 2009 17:21:25 -0400

On Fri, 2009-07-24 at 15:00 -0600, Andreas Dilger wrote:
> On Jul 24, 2009  16:13 -0400, Eric Paris wrote:
> > fanotify kernel/userspace interaction is over a new socket protocol.  A
> > listener opens a new socket in the new PF_FANOTIFY family.  The socket
> > is then bound to an address.  Using the following struct:
> 
> Would it make sense to use existing netlink?

I looked at netlink, but because of the nature of the fact that fd
creation has to be done in the listener context I couldn't figure out
how to make it suitable.

> > struct fanotify_addr {
> >         sa_family_t family;
> >         __u32 priority;
> >         __u32 group_num;
> >         __u32 mask;
> >         __u32 f_flags;
> >         __u32 unused[16];
> > }  __attribute__((packed));
> > 
> > The mask is the indication of the events this group is interested in.
> > The set of events of interest if FAN_GLOBAL_LISTENER is set at bind
> > time.  If FAN_GLOBAL_LISTENER is not set, this field is meaningless as
> > the registration of events on individual inodes will dictate the
> > reception of events.
> > 
> > * FAN_ACCESS: every file access.
> > * FAN_MODIFY: file modifications.
> > * FAN_CLOSE: files are closed.
> > * FAN_OPEN: open() calls.
> > * FAN_ACCESS_PERM: like FAN_ACCESS, except that the process trying to
> > access the file is put on hold while the fanotify client decides whether
> > to allow the operation.
> > * FAN_OPEN_PERM: like FAN_OPEN, but with the permission check.
> > * FAN_EVENT_ON_CHILD: receive notification of events on inodes inside
> > this subdirectory. (this is not a full recursive notification of all
> > descendants, only direct children)
> > * FAN_GLOBAL_LISTENER: notify for events on all files in the system.
> > * FAN_SURVIVE_MODIFY: special flag that ignores should survive inode
> > modification.  Discussed below.
> 
> It seems like a 32-bit mask might not be enough, it wouldn't be hard
> at this stage to add a 64-bit mask.  Lustre has a similar mechanism
> (changelog) that allows tracking all different kinds of filesystem
> events (create/unlink/symlink/link/rename/mkdir/setxattr/etc), instead
> of just open/close, also use by HSM, enhanced rsync, etc.

I had a 64 bit mask, but Al Viro ask me to go back to a 32 bit mask
because of i386 register pressure.  The bitmask operations are on VERY
hot paths inside the kernel.

> > struct fanotify_event_metadata {
> >         __u32 event_len;
> >         __s32 fd;
> >         __u32 mask;
> >         __u32 f_flags;
> >         __s32 pid;
> >         __s32 tgid;
> >         __u64 cookie;
> > }  __attribute__((packed));
> 
> Getting the attributes that have changed into this message is also
> useful, as it avoids a continual stream of "stat" calls on the inodes.

Hmmm, I'll take a look.  Do you have a good example of what you would
want to see?  I don't think we know in the notification hooks what
actually is being changed  :(

> The other thing that is important for HSM is that this log is atomic
> and persistent, otherwise there may be files that are missed if the
> node crashes.  This involves creating atomic update records as part
> of the filesystem operation, and then userspace consumes them and
> tells the kernel that it is finished with records up to X.  Otherwise
> you risk inconsistencies between rsync/HSM/updatedb for files that
> are updated just before a crash.

Uhhh, persistent across a crash?  Nope, don't have that.  Notification
is all in memory.  Can't I just put the onus on userspace to recheck
things maybe?  Sounds like a user for i_version....

> > If a FAN_ACCESS_PERM or FAN_OPEN_PERM event is received the listener
> > must send a response before the 5 second timeout.  If no response is
> > sent before the 5 second timeout the original operation is allowed.  If
> > this happens too many times (10 in a row) the fanotify group is evicted
> > from the kernel and will not get any new events.
> 
> This should be a tunable, since if the intent is to monitor PERM checks
> it would be possible for users to DOS the machine and delay the userspace
> programs and access files they shouldn't be able to.

At the moment I cheat and say root only to bind.  I do plan to open it
up to non-root users after it's in and working, but I'm seriously
considering leaving _PERM events as root only.  It's hard to map the
original to listener security implications.  So making sure the listener
is always root is easy   :)

Userspace would never be able to access a file it shouldn't be allowed
to (the new fd is created in the context of the listener and EPERM is
possible.)

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html