From: Linus Torvalds > Sent: 29 June 2020 18:03 > On Mon, Jun 29, 2020 at 8:29 AM Christoph Hellwig <hch@xxxxxx> wrote: > > > > So based on that I'd rather get away without our flag and tag the > > kernel pointer case in setsockopt explicitly. > > Yeah, I'd be ok to pass that kind of flag around for setsockopt, in > ways I _don't_ want to do for some very core vfs thing like 'read()'. > > That said, is there no practical limit on how big "optlen" can be? > Sure, I realize that a lot of setsockopt users may not use all of the > data, but let's say that "optlen" is 128, but the actual low-level > setsockopt operation only uses the first 16 bytes, maybe we could > always just copy the 128 bytes from user space into kernel space, and > just say "setsockopt() always gets a kernel pointer". > > Then the bpf use is even simpler. It would just pass the kernel > pointer natively. > > Because that seems to be what the BPF code really wants to do: it > takes the user optval, and munges it into a kernel optval, and then > (if that has been done) runs the low-level sock_setsockopt() under > KERNEL_DS. > > Couldn't we switch things around instead, and just *always* copy > things from user space, and sock_setsockopt (and > sock->ops->setsockopt) _always_ get a kernel buffer? > > And avoid the set_fs(KERNEL_DS) games entirely that way? I did a patch for SCTP to do the copies in the protocol wrapper. Apart from the issue of bad applications providing overlarge buffers and effecting a local DoS attack there were some odd issues: 1) SCTP completely abuses both setsockopt and getsockopt to perform additional socket operations. I suspect the original implementation didn't want to add new system calls. 2) SCTP treats getsockopt as RMW on the user buffer. Mostly it only needs 4 bytes, but it can in include a sockaddr_storage. 3) SCTP has one getsockopt that is really a setsockopt (ie changes things) but is implemented using getsockopt so that it can return a value. 4) One of the SCTP getsockopt calls has to return the 'wrong' value to userspace (ie not the length of the transferred data) for compatibility with the orginal broken code. I'm wondering if the [sg]etsockopt wrapper should actually pass through a structure containing: Kernel buffer address (on stack if short) User buffer address (may be NULL) Length of buffer copy_to_user length (normally zero) flag: embedded pointers are user/kernel Most code will just use the kernel buffer and return length/error. Code that knows the supplied length is invalid can use the user pointer - but only support direct user requests. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)