On Wed 24-07-24 15:15:35, Christian Brauner wrote: > Systemd has a helper called openat_report_new() that returns whether a > file was created anew or it already existed before for cases where > O_CREAT has to be used without O_EXCL (cf. [1]). That apparently isn't > something that's specific to systemd but it's where I noticed it. > > The current logic is that it first attempts to open the file without > O_CREAT | O_EXCL and if it gets ENOENT the helper tries again with both > flags. If that succeeds all is well. If it now reports EEXIST it > retries. > > That works fairly well but some corner cases make this more involved. If > this operates on a dangling symlink the first openat() without O_CREAT | > O_EXCL will return ENOENT but the second openat() with O_CREAT | O_EXCL > will fail with EEXIST. The reason is that openat() without O_CREAT | > O_EXCL follows the symlink while O_CREAT | O_EXCL doesn't for security > reasons. So it's not something we can really change unless we add an > explicit opt-in via O_FOLLOW which seems really ugly. > > The caller could try and use fanotify() to register to listen for > creation events in the directory before calling openat(). The caller > could then compare the returned tid to its own tid to ensure that even > in threaded environments it actually created the file. That might work > but is a lot of work for something that should be fairly simple and I'm > uncertain about it's reliability. > > The caller could use a bpf lsm hook to hook into security_file_open() to > figure out whether they created the file. That also seems a bit wild. > > So let's add F_CREATED which allows the caller to check whether they > actually did create the file. That has caveats of course but I don't > think they are problematic: > > * In multi-threaded environments a thread can only be sure that it did > create the file if it calls openat() with O_CREAT. In other words, > it's obviously not enough to just go through it's fdtable and check > these fds because another thread could've created the file. > > * If there's any codepaths where an openat() with O_CREAT would yield > the same struct file as that of another thread it would obviously > cause wrong results. I'm not aware of any such codepaths from openat() > itself. Imho, that would be a bug. > > * Related to the previous point, calling the new fcntl() on files created > and opened via special-purpose system calls or ioctl()s would cause > wrong results only if the affected subsystem a) raises FMODE_CREATED > and b) may return the same struct file for two different calls. I'm > not seeing anything outside of regular VFS code that raises > FMODE_CREATED. > > There is code for b) in e.g., the drm layer where the same struct file > is resurfaced but again FMODE_CREATED isn't used and it would be very > misleading if it did. > > Link: https://github.com/systemd/systemd/blob/11d5e2b5fbf9f6bfa5763fd45b56829ad4f0777f/src/basic/fs-util.c#L1078 [1] > Signed-off-by: Christian Brauner <brauner@xxxxxxxxxx> Looks like a sensible functionality. I agree with Jeff that the behavior wrt dup(2) / fork(2) needs to be clearly stated but is hardly surprising given how everything else with file descriptors and descriptions works. So feel free to add: Reviewed-by: Jan Kara <jack@xxxxxxx> Honza > --- > fs/fcntl.c | 10 ++++++++++ > include/uapi/linux/fcntl.h | 3 +++ > 2 files changed, 13 insertions(+) > > diff --git a/fs/fcntl.c b/fs/fcntl.c > index 300e5d9ad913..55a66ad9b432 100644 > --- a/fs/fcntl.c > +++ b/fs/fcntl.c > @@ -343,6 +343,12 @@ static long f_dupfd_query(int fd, struct file *filp) > return f.file == filp; > } > > +/* Let the caller figure out whether a given file was just created. */ > +static long f_created(const struct file *filp) > +{ > + return !!(filp->f_mode & FMODE_CREATED); > +} > + > static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, > struct file *filp) > { > @@ -352,6 +358,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, > long err = -EINVAL; > > switch (cmd) { > + case F_CREATED: > + err = f_created(filp); > + break; > case F_DUPFD: > err = f_dupfd(argi, filp, 0); > break; > @@ -463,6 +472,7 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, > static int check_fcntl_cmd(unsigned cmd) > { > switch (cmd) { > + case F_CREATED: > case F_DUPFD: > case F_DUPFD_CLOEXEC: > case F_DUPFD_QUERY: > diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h > index c0bcc185fa48..d78a6c237688 100644 > --- a/include/uapi/linux/fcntl.h > +++ b/include/uapi/linux/fcntl.h > @@ -16,6 +16,9 @@ > > #define F_DUPFD_QUERY (F_LINUX_SPECIFIC_BASE + 3) > > +/* Was the file just created? */ > +#define F_CREATED (F_LINUX_SPECIFIC_BASE + 4) > + > /* > * Cancel a blocking posix lock; internal use only until we expose an > * asynchronous lock api to userspace: > > -- > 2.43.0 > -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR