Hi, Ok, this is more complicated than I thought, again. On 6/29/20 9:53 AM, Johannes Thumshirn wrote: > With the recent addition of filesystem checksum types other than CRC32c, > it is not anymore hard-coded which checksum type a btrfs filesystem uses. > > Up to now there is no good way to read the filesystem checksum, apart from > reading the filesystem UUID and then query sysfs for the checksum type. > > Add a new csum_type and csum_size fields to the BTRFS_IOC_FS_INFO ioctl > command which usually is used to query filesystem features. Also add a > flags member indicating that the kernel responded with a set csum_type and > csum_size field. > > For compatibility reasons, only return the csum_type and csum_size if the > BTRFS_FS_INFO_FLAG_CSUM_TYPE_SIZE flag was passed to the kernel. No, that's not for 'compatibility reasons'. It's to allow "fine grained data retrieval". My question for David is still relevant: "If we think that with whatever being added in the future the output will still only contain a bunch of values that are very cheap to collect when filling the response buffer, then why would you still want to have this?" (The 'for compatibility reason' is about the V1 and V2 of the patch behavior, that sets the very useful flags output field so that the caller knows which parts of the result actually have a meaning.) ---- 8< ---- The thing is that accepting input opens cans (tm) of worms (tm). First of all, there is the issue of the lack of checking for a zeroed buffer from user space. As we found out, that's actually a bigger problem for an ioctl where the flags (and garbage in them - possibly silently) modify behavior of execution and subsequent result (LOGICAL_INO case) than in the case of just a data collecting ioctl which returns a bunch of values: garbage in, garbage out, and the (old) calling code is probably ignoring both input and output fields. However, demanding unused fields (and unused input flags) to be zero (like LOGICAL_INO_V2 does now) has an interesting side effect... If I ask for flags 11110011011101111, and I get -EINVAL, then apparently I'm running some kernel that is missing support for one of the bits... ¯\_(ツ)_/¯ Good luck finding out which one of the 1s is problematic. Try all permutations? Try them one by one? Or was there any other path in the code that returned the -EINVAL? (We had a discussion about this on IRC earlier this evening.) So, to help user space a bit, you could then use a flags /* out */ field that, in combination with -EINVAL will contain the flags of all functionality that the running kernel can actually support. Then the calling code can decide if there's something acceptable in there, or to error out and tell the user to upgrade. This needs to be there from the early days of the ioctl of course, or when starting to use more previously unused fields that were properly zeroed out in the output before. Also, using the same flags field for input as well as for output seems to be clever because it saves some bytes which can be used for other stuff in the future, but an interesting remark made was that if anyone ever wants to write support for a tool like strace for the ioctl, then it would be oh so nice if that could be stateless, so it can tell you something like "ha! the user requested features WXYZ and it was reported back that only X and Z are supported", instead of having to jump through inconvenient hoops trying to save state from earlier data seen flying by. I know that in general ioctl stuff is a big mess already, but it would be fun to think of things that can be done to decrease the rate in which it's getting even worse, for example by collecting some best practices and been-there-done-that-oops examples in some guide or whatever... "Ok, I have this input buffer which was never checked, what options do I have left before having to do _V+1?". And I do not think that should be done in a hurry while we want to have this patch in now. ---- >8 ---- So, (apologies if I'm already causing you a headache right at the beginning of the week), I'd say... Let's for now park everything I just wrote and drop the whole idea of using input flags for FS_INFO, and keep it output-only and just properly document the output values that have a meaning with the flags bits. Because at least we know that the unused output fields were zeroed before (phew!!). > Also > clear any unknown flags so we don't pass false positives to user-space > newer than the kernel. ^^ about the /* in/out */ > To simplify further additions to the ioctl, also switch the padding to a > u8 array. Pahole was used to verify the result of this switch: > > pahole -C btrfs_ioctl_fs_info_args fs/btrfs/btrfs.ko > struct btrfs_ioctl_fs_info_args { > __u64 max_id; /* 0 8 */ > __u64 num_devices; /* 8 8 */ > __u8 fsid[16]; /* 16 16 */ > __u32 nodesize; /* 32 4 */ > __u32 sectorsize; /* 36 4 */ > __u32 clone_alignment; /* 40 4 */ > __u32 flags; /* 44 4 */ > __u16 csum_type; /* 48 2 */ > __u16 csum_size; /* 50 2 */ > __u8 reserved[972]; /* 52 972 */ > > /* size: 1024, cachelines: 16, members: 10 */ > }; > > Fixes: 3951e7f050ac ("btrfs: add xxhash64 to checksumming algorithms") > Fixes: 3831bf0094ab ("btrfs: add sha256 to checksumming algorithm") > CC: stable@xxxxxxxxxxxxxxx # 5.5+ > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@xxxxxxx> > --- > Changes to v3: > * make flags in/out (David) > * make csum return opt-in (Hans) > > Changes to v2: > * add additional csum_size (David) > * rename flag value to BTRFS_FS_INFO_FLAG_CSUM_TYPE_SIZE to reflect > additional size > > Changes to v1: > * add 'out' comment to be consistent (Hans) > * remove le16_to_cpu() (kbuild robot) > * switch padding to be all u8 (David) > --- > fs/btrfs/ioctl.c | 16 +++++++++++++--- > include/uapi/linux/btrfs.h | 14 ++++++++++++-- > 2 files changed, 25 insertions(+), 5 deletions(-) > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > index b3e4c632d80c..4d70b918f656 100644 > --- a/fs/btrfs/ioctl.c > +++ b/fs/btrfs/ioctl.c > @@ -3198,11 +3198,15 @@ static long btrfs_ioctl_fs_info(struct btrfs_fs_info *fs_info, > struct btrfs_ioctl_fs_info_args *fi_args; > struct btrfs_device *device; > struct btrfs_fs_devices *fs_devices = fs_info->fs_devices; > + u32 inflags; > int ret = 0; > > - fi_args = kzalloc(sizeof(*fi_args), GFP_KERNEL); > - if (!fi_args) > - return -ENOMEM; > + fi_args = memdup_user(arg, sizeof(*fi_args)); > + if (IS_ERR(fi_args)) > + return PTR_ERR(fi_args); > + > + inflags = fi_args->flags; > + fi_args->flags = 0; Now you got the flags, but the user provided contents are still present in parts of fi_args that are not going to be overwritten. I don't think that's what you would have wanted. What about zeroing the whole fi_args again after getting the input args out? At least in the past all unused output was zeroed out. Now it's garbage in, garbage out. But if we go back to out-only, that's not relevant any more. :) > rcu_read_lock(); > fi_args->num_devices = fs_devices->num_devices; > @@ -3218,6 +3222,12 @@ static long btrfs_ioctl_fs_info(struct btrfs_fs_info *fs_info, > fi_args->sectorsize = fs_info->sectorsize; > fi_args->clone_alignment = fs_info->sectorsize; > > + if (inflags & BTRFS_FS_INFO_FLAG_CSUM_TYPE_SIZE) { > + fi_args->csum_type = btrfs_super_csum_type(fs_info->super_copy); > + fi_args->csum_size = btrfs_super_csum_size(fs_info->super_copy); > + fi_args->flags |= BTRFS_FS_INFO_FLAG_CSUM_TYPE_SIZE; > + } > + > if (copy_to_user(arg, fi_args, sizeof(*fi_args))) > ret = -EFAULT; > > diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h > index e6b6cb0f8bc6..c130eaea416e 100644 > --- a/include/uapi/linux/btrfs.h > +++ b/include/uapi/linux/btrfs.h > @@ -250,10 +250,20 @@ struct btrfs_ioctl_fs_info_args { > __u32 nodesize; /* out */ > __u32 sectorsize; /* out */ > __u32 clone_alignment; /* out */ > - __u32 reserved32; > - __u64 reserved[122]; /* pad to 1k */ > + __u32 flags; /* in/out */ > + __u16 csum_type; /* out */ > + __u16 csum_size; /* out */ > + __u8 reserved[972]; /* pad to 1k */ > }; > > +/* > + * fs_info ioctl flags > + * > + * Used by: > + * struct btrfs_ioctl_fs_info_args > + */ > +#define BTRFS_FS_INFO_FLAG_CSUM_TYPE_SIZE (1 << 0) > + > /* > * feature flags > * > K