On Wed 28-08-24 20:19:42, Aleksa Sarai wrote: > Unfortunately, the way we have gone about adding new AT_* flags has > been a little messy. In the beginning, all of the AT_* flags had generic > meanings and so it made sense to share the flag bits indiscriminately. > However, we inevitably ran into syscalls that needed their own > syscall-specific flags. Due to the lack of a planned out policy, we > ended up with the following situations: > > * Existing syscalls adding new features tended to use new AT_* bits, > with some effort taken to try to re-use bits for flags that were so > obviously syscall specific that they only make sense for a single > syscall (such as the AT_EACCESS/AT_REMOVEDIR/AT_HANDLE_FID triplet). > > Given the constraints of bitflags, this works well in practice, but > ideally (to avoid future confusion) we would plan ahead and define a > set of "per-syscall bits" ahead of time so that when allocating new > bits we don't end up with a complete mish-mash of which bits are > supposed to be per-syscall and which aren't. > > * New syscalls dealt with this in several ways: > > - Some syscalls (like renameat2(2), move_mount(2), fsopen(2), and > fspick(2)) created their separate own flag spaces that have no > overlap with the AT_* flags. Most of these ended up allocating > their bits sequentually. > > In the case of move_mount(2) and fspick(2), several flags have > identical meanings to AT_* flags but were allocated in their own > flag space. > > This makes sense for syscalls that will never share AT_* flags, but > for some syscalls this leads to duplication with AT_* flags in a > way that could cause confusion (if renameat2(2) grew a > RENAME_EMPTY_PATH it seems likely that users could mistake it for > AT_EMPTY_PATH since it is an *at(2) syscall). > > - Some syscalls unfortunately ended up both creating their own flag > space while also using bits from other flag spaces. The most > obvious example is open_tree(2), where the standard usage ends up > using flags from *THREE* separate flag spaces: > > open_tree(AT_FDCWD, "/foo", OPEN_TREE_CLONE|O_CLOEXEC|AT_RECURSIVE); > > (Note that O_CLOEXEC is also platform-specific, so several future > OPEN_TREE_* bits are also made unusable in one fell swoop.) > > It's not entirely clear to me what the "right" choice is for new > syscalls. Just saying that all future VFS syscalls should use AT_* flags > doesn't seem practical. openat2(2) has RESOLVE_* flags (many of which > don't make much sense to burn generic AT_* flags for) and move_mount(2) > has separate AT_*-like flags for both the source and target so separate > flags are needed anyway (though it seems possible that renameat2(2) > could grow *_EMPTY_PATH flags at some point, and it's a bit of a shame > they can't be reused). > > But at least for syscalls that _do_ choose to use AT_* flags, we should > explicitly state the policy that 0x2ff is currently intended for > per-syscall flags and that new flags should err on the side of > overlapping with existing flag bits (so we can extend the scope of > generic flags in the future if necessary). > > And add AT_* aliases for the RENAME_* flags to further cement that > renameat2(2) is an *at(2) flag, just with its own per-syscall flags. > > Suggested-by: Amir Goldstein <amir73il@xxxxxxxxx> > Reviewed-by: Jeff Layton <jlayton@xxxxxxxxxx> > Reviewed-by: Josef Bacik <josef@xxxxxxxxxxxxxx> > Signed-off-by: Aleksa Sarai <cyphar@xxxxxxxxxx> Looks good. Feel free to add: Reviewed-by: Jan Kara <jack@xxxxxxx> Honza > --- > include/uapi/linux/fcntl.h | 80 ++++++++++++++------- > tools/perf/trace/beauty/include/uapi/linux/fcntl.h | 83 +++++++++++++++------- > 2 files changed, 115 insertions(+), 48 deletions(-) > > diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h > index e55a3314bcb0..38a6d66d9e88 100644 > --- a/include/uapi/linux/fcntl.h > +++ b/include/uapi/linux/fcntl.h > @@ -90,37 +90,69 @@ > #define DN_ATTRIB 0x00000020 /* File changed attibutes */ > #define DN_MULTISHOT 0x80000000 /* Don't remove notifier */ > > +#define AT_FDCWD -100 /* Special value for dirfd used to > + indicate openat should use the > + current working directory. */ > + > + > +/* Generic flags for the *at(2) family of syscalls. */ > + > +/* Reserved for per-syscall flags 0xff. */ > +#define AT_SYMLINK_NOFOLLOW 0x100 /* Do not follow symbolic > + links. */ > +/* Reserved for per-syscall flags 0x200 */ > +#define AT_SYMLINK_FOLLOW 0x400 /* Follow symbolic links. */ > +#define AT_NO_AUTOMOUNT 0x800 /* Suppress terminal automount > + traversal. */ > +#define AT_EMPTY_PATH 0x1000 /* Allow empty relative > + pathname to operate on dirfd > + directly. */ > /* > - * The constants AT_REMOVEDIR and AT_EACCESS have the same value. AT_EACCESS is > - * meaningful only to faccessat, while AT_REMOVEDIR is meaningful only to > - * unlinkat. The two functions do completely different things and therefore, > - * the flags can be allowed to overlap. For example, passing AT_REMOVEDIR to > - * faccessat would be undefined behavior and thus treating it equivalent to > - * AT_EACCESS is valid undefined behavior. > + * These flags are currently statx(2)-specific, but they could be made generic > + * in the future and so they should not be used for other per-syscall flags. > */ > -#define AT_FDCWD -100 /* Special value used to indicate > - openat should use the current > - working directory. */ > -#define AT_SYMLINK_NOFOLLOW 0x100 /* Do not follow symbolic links. */ > +#define AT_STATX_SYNC_TYPE 0x6000 /* Type of synchronisation required from statx() */ > +#define AT_STATX_SYNC_AS_STAT 0x0000 /* - Do whatever stat() does */ > +#define AT_STATX_FORCE_SYNC 0x2000 /* - Force the attributes to be sync'd with the server */ > +#define AT_STATX_DONT_SYNC 0x4000 /* - Don't sync attributes with the server */ > + > +#define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */ > + > +/* > + * Per-syscall flags for the *at(2) family of syscalls. > + * > + * These are flags that are so syscall-specific that a user passing these flags > + * to the wrong syscall is so "clearly wrong" that we can safely call such > + * usage "undefined behaviour". > + * > + * For example, the constants AT_REMOVEDIR and AT_EACCESS have the same value. > + * AT_EACCESS is meaningful only to faccessat, while AT_REMOVEDIR is meaningful > + * only to unlinkat. The two functions do completely different things and > + * therefore, the flags can be allowed to overlap. For example, passing > + * AT_REMOVEDIR to faccessat would be undefined behavior and thus treating it > + * equivalent to AT_EACCESS is valid undefined behavior. > + * > + * Note for implementers: When picking a new per-syscall AT_* flag, try to > + * reuse already existing flags first. This leaves us with as many unused bits > + * as possible, so we can use them for generic bits in the future if necessary. > + */ > + > +/* Flags for renameat2(2) (must match legacy RENAME_* flags). */ > +#define AT_RENAME_NOREPLACE 0x0001 > +#define AT_RENAME_EXCHANGE 0x0002 > +#define AT_RENAME_WHITEOUT 0x0004 > + > +/* Flag for faccessat(2). */ > #define AT_EACCESS 0x200 /* Test access permitted for > effective IDs, not real IDs. */ > +/* Flag for unlinkat(2). */ > #define AT_REMOVEDIR 0x200 /* Remove directory instead of > unlinking file. */ > -#define AT_SYMLINK_FOLLOW 0x400 /* Follow symbolic links. */ > -#define AT_NO_AUTOMOUNT 0x800 /* Suppress terminal automount traversal */ > -#define AT_EMPTY_PATH 0x1000 /* Allow empty relative pathname */ > - > -#define AT_STATX_SYNC_TYPE 0x6000 /* Type of synchronisation required from statx() */ > -#define AT_STATX_SYNC_AS_STAT 0x0000 /* - Do whatever stat() does */ > -#define AT_STATX_FORCE_SYNC 0x2000 /* - Force the attributes to be sync'd with the server */ > -#define AT_STATX_DONT_SYNC 0x4000 /* - Don't sync attributes with the server */ > - > -#define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */ > +/* Flags for name_to_handle_at(2). */ > +#define AT_HANDLE_FID 0x200 /* File handle is needed to compare > + object identity and may not be > + usable with open_by_handle_at(2). */ > > -/* Flags for name_to_handle_at(2). We reuse AT_ flag space to save bits... */ > -#define AT_HANDLE_FID AT_REMOVEDIR /* file handle is needed to > - compare object identity and may not > - be usable to open_by_handle_at(2) */ > #if defined(__KERNEL__) > #define AT_GETATTR_NOSEC 0x80000000 > #endif > diff --git a/tools/perf/trace/beauty/include/uapi/linux/fcntl.h b/tools/perf/trace/beauty/include/uapi/linux/fcntl.h > index c0bcc185fa48..38a6d66d9e88 100644 > --- a/tools/perf/trace/beauty/include/uapi/linux/fcntl.h > +++ b/tools/perf/trace/beauty/include/uapi/linux/fcntl.h > @@ -16,6 +16,9 @@ > > #define F_DUPFD_QUERY (F_LINUX_SPECIFIC_BASE + 3) > > +/* Was the file just created? */ > +#define F_CREATED_QUERY (F_LINUX_SPECIFIC_BASE + 4) > + > /* > * Cancel a blocking posix lock; internal use only until we expose an > * asynchronous lock api to userspace: > @@ -87,37 +90,69 @@ > #define DN_ATTRIB 0x00000020 /* File changed attibutes */ > #define DN_MULTISHOT 0x80000000 /* Don't remove notifier */ > > +#define AT_FDCWD -100 /* Special value for dirfd used to > + indicate openat should use the > + current working directory. */ > + > + > +/* Generic flags for the *at(2) family of syscalls. */ > + > +/* Reserved for per-syscall flags 0xff. */ > +#define AT_SYMLINK_NOFOLLOW 0x100 /* Do not follow symbolic > + links. */ > +/* Reserved for per-syscall flags 0x200 */ > +#define AT_SYMLINK_FOLLOW 0x400 /* Follow symbolic links. */ > +#define AT_NO_AUTOMOUNT 0x800 /* Suppress terminal automount > + traversal. */ > +#define AT_EMPTY_PATH 0x1000 /* Allow empty relative > + pathname to operate on dirfd > + directly. */ > +/* > + * These flags are currently statx(2)-specific, but they could be made generic > + * in the future and so they should not be used for other per-syscall flags. > + */ > +#define AT_STATX_SYNC_TYPE 0x6000 /* Type of synchronisation required from statx() */ > +#define AT_STATX_SYNC_AS_STAT 0x0000 /* - Do whatever stat() does */ > +#define AT_STATX_FORCE_SYNC 0x2000 /* - Force the attributes to be sync'd with the server */ > +#define AT_STATX_DONT_SYNC 0x4000 /* - Don't sync attributes with the server */ > + > +#define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */ > + > /* > - * The constants AT_REMOVEDIR and AT_EACCESS have the same value. AT_EACCESS is > - * meaningful only to faccessat, while AT_REMOVEDIR is meaningful only to > - * unlinkat. The two functions do completely different things and therefore, > - * the flags can be allowed to overlap. For example, passing AT_REMOVEDIR to > - * faccessat would be undefined behavior and thus treating it equivalent to > - * AT_EACCESS is valid undefined behavior. > + * Per-syscall flags for the *at(2) family of syscalls. > + * > + * These are flags that are so syscall-specific that a user passing these flags > + * to the wrong syscall is so "clearly wrong" that we can safely call such > + * usage "undefined behaviour". > + * > + * For example, the constants AT_REMOVEDIR and AT_EACCESS have the same value. > + * AT_EACCESS is meaningful only to faccessat, while AT_REMOVEDIR is meaningful > + * only to unlinkat. The two functions do completely different things and > + * therefore, the flags can be allowed to overlap. For example, passing > + * AT_REMOVEDIR to faccessat would be undefined behavior and thus treating it > + * equivalent to AT_EACCESS is valid undefined behavior. > + * > + * Note for implementers: When picking a new per-syscall AT_* flag, try to > + * reuse already existing flags first. This leaves us with as many unused bits > + * as possible, so we can use them for generic bits in the future if necessary. > */ > -#define AT_FDCWD -100 /* Special value used to indicate > - openat should use the current > - working directory. */ > -#define AT_SYMLINK_NOFOLLOW 0x100 /* Do not follow symbolic links. */ > + > +/* Flags for renameat2(2) (must match legacy RENAME_* flags). */ > +#define AT_RENAME_NOREPLACE 0x0001 > +#define AT_RENAME_EXCHANGE 0x0002 > +#define AT_RENAME_WHITEOUT 0x0004 > + > +/* Flag for faccessat(2). */ > #define AT_EACCESS 0x200 /* Test access permitted for > effective IDs, not real IDs. */ > +/* Flag for unlinkat(2). */ > #define AT_REMOVEDIR 0x200 /* Remove directory instead of > unlinking file. */ > -#define AT_SYMLINK_FOLLOW 0x400 /* Follow symbolic links. */ > -#define AT_NO_AUTOMOUNT 0x800 /* Suppress terminal automount traversal */ > -#define AT_EMPTY_PATH 0x1000 /* Allow empty relative pathname */ > - > -#define AT_STATX_SYNC_TYPE 0x6000 /* Type of synchronisation required from statx() */ > -#define AT_STATX_SYNC_AS_STAT 0x0000 /* - Do whatever stat() does */ > -#define AT_STATX_FORCE_SYNC 0x2000 /* - Force the attributes to be sync'd with the server */ > -#define AT_STATX_DONT_SYNC 0x4000 /* - Don't sync attributes with the server */ > - > -#define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */ > +/* Flags for name_to_handle_at(2). */ > +#define AT_HANDLE_FID 0x200 /* File handle is needed to compare > + object identity and may not be > + usable with open_by_handle_at(2). */ > > -/* Flags for name_to_handle_at(2). We reuse AT_ flag space to save bits... */ > -#define AT_HANDLE_FID AT_REMOVEDIR /* file handle is needed to > - compare object identity and may not > - be usable to open_by_handle_at(2) */ > #if defined(__KERNEL__) > #define AT_GETATTR_NOSEC 0x80000000 > #endif > > -- > 2.46.0 > -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR