On Sat, Jun 27, 2020 at 3:49 AM David Laight <David.Laight@xxxxxxxxxx> wrote: > > > Just keep the existing "set_fs()". It's not harmful if it's only used > > occasionally. We should rename it once it's rare enough, though. > > Am I right in thinking that it just sets a flag in 'current' ? Basically, yes. That's what it has always done. Well "always" is not true - it used to set the %fs segment register originally (thus the name), but _conceptually_ it sets a flag for "should user accesses be kernel accesses instead". On x86 - and most other architectures where user space and kernel space are in the same address space and accessed with the same instructions, that has then been implemented as just a "what is the limit for an access". On other architectures - architectures that need different access methods (or different flags to the load/store instruction) - it's an actual flag that changes which access method you use. > Although I don't remember access_ok() doing a suitable check > (would need to be (address - base) < limit). So again, on the architectures with a unified address space, access_ok() is exactly that "address + access_size <= limit", although often done with some inline asm just to get the overflow case done efficiently. On other architectures, there's no limit check, because _all_ addresses are either user space or kernel space addresses, and what changes isn't the address limit, but the access itself. So what I was suggesting is literally - keep this flag around as a flag - but make all _normal_ user accesses ignore it, and always do user accesses (so on a unified address space architecture like x86 it always checks the _fixed_ limit, and on something like sparc32 which has separate kernel and user address spaces, it just always does a user access with no conditionals at all) - then make the really odd and hopefully very rare cases check that flag explicitly and manually, and do if (current->legacy_uptr_is_kernel) memcpy(...); else copy_to/from_user(...); and my hope is that we'd have only a handful of cases (like the setsockopt thing: one for each protocol or whatever) that actually want this. Note that the legacy behavior would still remain in architectures that haven't been modified to remove the use of set_fs(), so I would further suggest that the two approaches live side-by-side for at least a while. But _generic_ code (and with Christoph's patches at least x86) would make set_fs() cause a build error. So we'd have a new set_force_kernel_pointers(); .... clear_force_kernel_pointers(); that would set/clear that 'current->legacy_uptr_is_kernel' variable, and we'd have a handful of places that would check it. The naming above is all random, and I'm not claiming that any of this is particularly _clean_. I'm also not claiming that it's really any better than our current "set_fs()" mess conceptually. The only thing that makes it better than our current "set_fs()" is - there would hopefully be very few cases of this - it would *not* affect random incidental user accesses that just happen to be in the shadow of this thing. That second point is the important one, I feel. The real problem with "set_fs()" has been that we've occasionally had bugs where we ended up running odd paths that we really didn't _intend_ to run with kernel pointers. The classic example is the SCSI "write as ioctl" example, where a write to a SCSI generic device would do various odd things and follow pointers and what-not. Then you get into real trouble when "splice()" ends up truiong to write a kernel buffer, and because of "set_fs()" suddenly the sg code started accessing kernel memory willy-nilly. So my suggestion was basically a new version of set_fs(), but one that is just much more targeted, and doesn't affect all random user accesses, only those very special ones that are then very *explicitly* aware of the fact that "hey, I might be called in this situation where I'm going to get a kernel address instead". > If that is needed (I presume it was added for a purpose) then all > the socket option code needs to be able to handle kernel buffers. So that's very much what I'd like to avoid. The plan would be that all the *normal* stuff would be handled by either (a) always having the data come from user space, or (b) the data has a known size (either fixed, or "optlen" or whatever) and then being copied to a kernel buffer and then always handled as a kernel field that bpf can then call with kernel data. I thought there was just one very specific case of "oh, in certain cases of setsockopt we don't know what size this address is and optlen is ignored", so we have to just pass the pointer down to the protocol, which is the point that knows how much of an address it wants.. Was that a misunderstanding on my part? Because if there are tons and tons of places that want this "either kernel or user" then we could still have a helper function for it, but it means that the whole "limit the cases" advantage to some degree goes away. It would still fix the 99% of normal "copy/from/to_user()" cases, though. They'd be fixed and "safe" and coule never ever touch kernel memory even if there was some confusion about things. So it would be an improvement, but I was really hoping that the cases where there can be confusion would be pretty rare. Linus