On Wed, Apr 3, 2024 at 8:02 AM Christian Brauner <brauner@xxxxxxxxxx> wrote: > > On Tue, Apr 02, 2024 at 09:39:57AM +0000, Alice Ryhl wrote: > > Christian Brauner <brauner@xxxxxxxxxx> wrote: > > > On Mon, Apr 01, 2024 at 12:09:08PM +0000, Alice Ryhl wrote: > > >> Christian Brauner <brauner@xxxxxxxxxx> wrote: > > >>> On Wed, Mar 20, 2024 at 06:09:05PM +0000, Alice Ryhl wrote: > > >>>> Christian Brauner <brauner@xxxxxxxxxx> wrote: > > >>>>> On Fri, Feb 09, 2024 at 11:18:16AM +0000, Alice Ryhl wrote: > > >>>>>> +/// Wraps the kernel's `struct file`. > > >>>>>> +/// > > >>>>>> +/// This represents an open file rather than a file on a filesystem. Processes generally reference > > >>>>>> +/// open files using file descriptors. However, file descriptors are not the same as files. A file > > >>>>>> +/// descriptor is just an integer that corresponds to a file, and a single file may be referenced > > >>>>>> +/// by multiple file descriptors. > > >>>>>> +/// > > >>>>>> +/// # Refcounting > > >>>>>> +/// > > >>>>>> +/// Instances of this type are reference-counted. The reference count is incremented by the > > >>>>>> +/// `fget`/`get_file` functions and decremented by `fput`. The Rust type `ARef<File>` represents a > > >>>>>> +/// pointer that owns a reference count on the file. > > >>>>>> +/// > > >>>>>> +/// Whenever a process opens a file descriptor (fd), it stores a pointer to the file in its `struct > > >>>>>> +/// files_struct`. This pointer owns a reference count to the file, ensuring the file isn't > > >>>>>> +/// prematurely deleted while the file descriptor is open. In Rust terminology, the pointers in > > >>>>>> +/// `struct files_struct` are `ARef<File>` pointers. > > >>>>>> +/// > > >>>>>> +/// ## Light refcounts > > >>>>>> +/// > > >>>>>> +/// Whenever a process has an fd to a file, it may use something called a "light refcount" as a > > >>>>>> +/// performance optimization. Light refcounts are acquired by calling `fdget` and released with > > >>>>>> +/// `fdput`. The idea behind light refcounts is that if the fd is not closed between the calls to > > >>>>>> +/// `fdget` and `fdput`, then the refcount cannot hit zero during that time, as the `struct > > >>>>>> +/// files_struct` holds a reference until the fd is closed. This means that it's safe to access the > > >>>>>> +/// file even if `fdget` does not increment the refcount. > > >>>>>> +/// > > >>>>>> +/// The requirement that the fd is not closed during a light refcount applies globally across all > > >>>>>> +/// threads - not just on the thread using the light refcount. For this reason, light refcounts are > > >>>>>> +/// only used when the `struct files_struct` is not shared with other threads, since this ensures > > >>>>>> +/// that other unrelated threads cannot suddenly start using the fd and close it. Therefore, > > >>>>>> +/// calling `fdget` on a shared `struct files_struct` creates a normal refcount instead of a light > > >>>>>> +/// refcount. > > >>>>> > > >>>>> When the fdget() calling task doesn't have a shared file descriptor > > >>>>> table fdget() will not increment the reference count, yes. This > > >>>>> also implies that you cannot have task A use fdget() and then pass > > >>>>> f.file to task B that holds on to it while A returns to userspace. It's > > >>>>> irrelevant that task A won't drop the reference count or that task B > > >>>>> won't drop the reference count. Because task A could return back to > > >>>>> userspace and immediately close the fd via a regular close() system call > > >>>>> at which point task B has a UAF. In other words a file that has been > > >>>>> gotten via fdget() can't be Send to another task without the Send > > >>>>> implying taking a reference to it. > > >>>> > > >>>> That matches my understanding. > > >>>> > > >>>> I suppose that technically you can still send it to another thread *if* you > > >>>> ensure that the current thread waits until that other thread stops using the > > >>>> file before returning to userspace. > > >>> > > >>> _Technically_ yes, but it would be brittle as hell. The problem is that > > >>> fdget() _relies_ on being single-threaded for the time that fd is used > > >>> until fdput(). There's locking assumptions that build on that e.g., for > > >>> concurrent read/write. So no, that shouldn't be allowed. > > >>> > > >>> Look at how this broke our back when we introduced pidfd_getfd() where > > >>> we steal an fd from another task. I have a lengthy explanation how that > > >>> can be used to violate our elided-locking which is based on assuming > > >>> that we're always single-threaded and the file can't be suddenly shared > > >>> with another task. So maybe doable but it would make the semantics even > > >>> more intricate. > > >> > > >> Hmm, the part about elided locking is surprising to me, and may be an > > >> issue. Can you give more details on that? Because the current > > > > > > So what I referred to was that we do have fdget_pos(). Roughly, if > > > there's more than one reference on the file then we need to acquire a > > > mutex but if it's only a single reference then we can avoid taking the > > > mutex because we know that we're the only one that has a reference to > > > that file and no one else can acquire one. Whether or not that mutex was > > > taken is taken track of in struct fd. > > > > > > So you can't share a file after fdget_pos() has been called on it and > > > you haven't taken the position mutex. So let's say you had: > > > > > > * Tread A that calls fdget_pos() on file1 and the reference count is > > > one. So Thread A doesn't acquire the file position mutex for file1. > > > * Now somehow that file1 becomes shared, e.g., Thread B calls fget() on > > > it and now Thread B does some operation that requires the file > > > position mutex. > > > => Thread A and Thread B race on the file position. > > > > > > So just because you have a reference to a file from somewhere it doesn't > > > mean you can just share it with another thread. > > > > > > So if yo have an arbitrary reference to a file in Rust and that somehow > > > can be shared with another thread you risk races here. > > > > > >> abstractions here *do* actually allow what I described, since we > > >> implement Sync for File. > > >> > > >> I'm not familiar with the pidfd_getfd discussion you are referring to. > > >> Do you have a link? > > > > > > https://lore.kernel.org/linux-fsdevel/20230724-vfs-fdget_pos-v1-1-a4abfd7103f3@xxxxxxxxxx > > > > > > pidfd_getfd() can be used to steal a file descriptor from another task. > > > It's like a non-cooperative SCM_RIGHTS. That means you can have exactly > > > the scenario described above where a file assumed to be non-shared is > > > suddenly shared and you have racing reads/writes. > > > > > > For readdir we nowadays always take the file position mutex because of > > > the pidfd_getfd() business because that might corrupt internal state. > > > > > >> > > >> I'm thinking that we may have to provide two different `struct file` > > >> wrappers to accurately model this API in Rust. Perhaps they could be > > >> called File and LocalFile, where one is marked as thread safe and the > > >> other isn't. I can make all LocalFile methods available on File to avoid > > >> having to duplicate methods that are available on both. > > > > > > But isn't that just struct file and struct fd? Ideally we'd stay close > > > to something like this. > > > > Right, that kind of naming seems sensible. But I still need to > > understand the details a bit better. See below on fdget_pos. > > > > >> But it's not clear to me that this is even enough. Even if we give you a > > >> &LocalFile to prevent you from moving it across threads, you can just > > >> call File::fget to get an ARef<File> to the same file and then move > > >> *that* across threads. > > > > > > Yes, absolutely. > > > > One of my challenges is that Binder wants to call File::fget, > > immediately move it to another thread, and then call fd_install. And > > it would be pretty unfortunate if that requires unsafe. But like I argue > > below, it seems hard to design a safe API for this in the face of > > fdget_pos. > > > > >> This kind of global requirement is not so easy to model. Maybe klint [1] > > >> could do it ... atomic context violations are a similar kind of global > > >> check. But having klint do it would be far out. > > >> > > >> Or maybe File::fget should also return a LocalFile? > > >> > > >> But this raises a different question to me. Let's say process A uses > > >> Binder to send its own fd to process B, and the following things happen: > > >> > > >> 1. Process A enters the ioctl and takes fdget on the fd. > > >> 2. Process A calls fget on the same fd to send it to another process. > > >> 3. Process A goes to sleep, waiting for process B to respond. > > >> 4. Process B receives the message, installs the fd, and returns to userspace. > > >> 5. Process B responds to the transaction, but does not close the fd. > > > > > > The fd just installed in 4. and the fd you're referring to in 5. are > > > identical, right? IOW, we're not talking about two different fd (dup) > > > for the same file, right? > > > > I'm referring to whatever fd_install does given the `struct file` I got > > from fget in step 2. > > > > >> 6a. Process A finishes sleeping, and returns to userspace from the ioctl. > > >> 6b. Process B tries to do an operation (e.g. read) on the fd. > > >> > > >> Let's say that 6a and 6b run in parallel. > > >> > > >> Could this potentially result in a data race between step 6a and 6b? I'm > > >> guessing that step 6a probably doesn't touch any of the code that has > > >> elided locking assumptions, so in practice I guess there's not a problem > > >> ... but if you make any sort of elided locking assumption as you exit > > >> from the ioctl (before reaching the fdput), then it seems to me that you > > >> have a problem. > > > > > > Yes, 6a doesn't touch any code that has elided locking assumptions. > > > > > > 1'. Process A enters the ioctl and takes fdget() on the fd. Process A > > > holds the only reference to that file and the file descriptor table > > > isn't shared. Therefore, f_count is left untouched and remains at 1. > > > 2'. Process A calls fget() which unconditionally bumps f_count bringing > > > it to 2 and sending it another process (Presumably you intend to > > > imply that this reference is now owned by the second process.). > > > 3'. [as 3.] > > > 4'. Process B installs the file into it's file descriptor table > > > consuming that reference from 2'. The f_count remains at 2 with the > > > reference from 2' now being owned by Process B. > > > 5'. Since Process B isn't closing the fd and has just called > > > fd_install() it returns to userspace with f_count untouched and > > > still at 2. > > > 6'a. Process A finishes sleeping and returns to userspace calling > > > fdput(). Since the original fdget() was done without bumping the > > > reference count the fdput() of Process A will not decrement the > > > reference count. So f_count remains at 2. > > > 6'b. Process B performs a read/write syscall and calls fdget_pos(). > > > fdget_pos() sees that this file has f_count > 1 and takes the > > > file position mutex. > > > > > > So this isn't a problem. The problem is when a file becomes shared > > > implicitly without the original owner of the file knowing. > > > > Hmm. Yes, but the ioctl code that called fdget doesn't really know that > > the ioctl shared the file? So why is it okay? > > Why does it matter to the ioctl() code itself? The ioctl() code never > calls fdget_pos(). > > > > > It really seems like there are two different things going on here. When > > it comes to fdget, we only really care about operations that could > > remove it from the local file descriptor table, since fdget relies on > > the refcount in that table remaining valid until fdput. > > Yes. > > > > > On the other hand, for fdget_pos it also matters whether it gets > > installed in other file descriptor tables. Threads that reference it > > through a different fd table will still access the same position. > > Yes, they operate on the same f_pos. > > > > > And so this means that between fdget/fdput, there's never any problem > > with installing the `struct file` into another file descriptor table. > > Nothing you can do from that other fd table could cause the local fd > > table to drop its refcount on the file. Whereas such an install can be > > a problem between fdget_pos/fdput_pos, since that could introduce a race > > on the position. > > > > Is this correct? > > Yes, but that would imply you're sharing and installing a file into a > file descriptor table from a read/write/seek codepath. I don't see how > this can happen without something like e.g., pidfd_getfd(). And the > fd_install()ing task would then have to go back to userspace and issue a > concurrent read/write/seek system call while the other thread is still > reading/writing. > > Overall, we really only care about f_pos consistency because posix > requires atomicity between reads/writes/seeks. For pidfd_getfd() where > such sharing can happen non-cooperatively we just don't care as we've > just declared this to be an instance where we're outside of posix > guarantees. And for readdir() we unconditionally acquire the mutex. > > I think io_uring is racing on f_pos as well under certain circumstances > (REQ_F_CUR_POS?) as they don't use fdget_pos() at all. But iirc Jens > dislikes that they ever allowed that. > > > > > I was thinking that if we have some sort of File/LocalFile distinction > > (or File/Fd), then we may be able to get it to work by limiting what a > > File can do. For example, let's say that the only thing you can do with > > a File is install it into fd tables, then by the previous logic, there's > > no problem with it being safe to move across threads even if there's an > > active fdget. > > > > But the fdget_pos kind of throws a wrench into that, because now I can > > no longer say "it's always safe to do File::fget, move it to another > > thread, and install it into the remote fd table", since that could cause > > races on the position if there's an active fdget_pos when we call > > File::fget. > > I think I understand why that's a problem for you but let me try and > spell it out so I can understand where you're coming from: > > You want the Rust compiler to reject a file becoming shared implicitly > from any codepath that is beneath fdget_pos() even though no such code > yet exists (ignoring pidfd_getfd()). So it's a correctness issue to you. Yes, exactly. One of the design principles behind Rust is that if an API allows you to trigger memory unsafety, then it *must* be the case that you called an unsafe function somewhere, and that this call violated the documented safety requirements for that unsafe function. So if the File API provides a safe interface that you can use to trigger memory unsafety, then that's a correctness issue for me. Thanks for asking this. I should have clarified this previously. > > >>>>>> +/// > > >>>>>> +/// Light reference counts must be released with `fdput` before the system call returns to > > >>>>>> +/// userspace. This means that if you wait until the current system call returns to userspace, then > > >>>>>> +/// all light refcounts that existed at the time have gone away. > > >>>>>> +/// > > >>>>>> +/// ## Rust references > > >>>>>> +/// > > >>>>>> +/// The reference type `&File` is similar to light refcounts: > > >>>>>> +/// > > >>>>>> +/// * `&File` references don't own a reference count. They can only exist as long as the reference > > >>>>>> +/// count stays positive, and can only be created when there is some mechanism in place to ensure > > >>>>>> +/// this. > > >>>>>> +/// > > >>>>>> +/// * The Rust borrow-checker normally ensures this by enforcing that the `ARef<File>` from which > > >>>>>> +/// a `&File` is created outlives the `&File`. > > >>>>> > > >>>>> The section confuses me a little: Does the borrow-checker always ensure > > >>>>> that a &File stays valid or are there circumstances where it doesn't or > > >>>>> are you saying it doesn't enforce it? > > >>>> > > >>>> The borrow-checker always ensures it. > > >>> > > >>> Ok, thanks. > > >>> > > >>>> > > >>>> A &File is actually short-hand for &'a File, where 'a is some > > >>>> unspecified lifetime. We say that &'a File is annotated with 'a. The > > >>>> borrow-checker rejects any code that tries to use a reference after the > > >>>> end of the lifetime annotated on it. > > >>> > > >>> Thanks for the explanation. > > >>> > > >>>> > > >>>> So as long as you annotate the reference with a sufficiently short > > >>>> lifetime, the borrow checker will prevent UAF. And outside of cases like > > >>> > > >>> Sorry, but can you explain "sufficiently short lifetime"? > > >> > > >> By "sufficiently short lifetime" I mean "lifetime that ends before the > > >> object is destroyed". > > > > > > Ah, ok. It sounded like it was a specific concept that Rust is > > > implementing in contrast to long-term lifetime or sm. Thanks! > > > > > >> > > >> Idea being that if the lifetime ends before the object is freed, and the > > >> borrow-checker rejects attempts to use it after the lifetime ends, then > > >> it follows that the borrow-checker prevents use-after-frees. > > >> > > >>>> from_ptr, the borrow-checker also takes care of ensuring that the > > >>>> lifetimes are sufficiently short. > > >>>> > > >>>> (Technically &'a File and &'b File are two completely different types, > > >>>> so &File is technically a class of types and not a single type. Rust > > >>>> will automatically convert &'long File to &'short File.) > > >>>> > > >>>>>> +/// > > >>>>>> +/// * Using the unsafe [`File::from_ptr`] means that it is up to the caller to ensure that the > > >>>>>> +/// `&File` only exists while the reference count is positive. > > >>>>> > > >>>>> What is this used for in binder? If it's not used don't add it. > > >>>> > > >>>> This is used on the boundary between the Rust part of Binder and the > > >>>> binderfs component that is implemented in C. For example: > > >>> > > >>> I see, I'm being foiled by my own code... > > >>> > > >>>> > > >>>> unsafe extern "C" fn rust_binder_open( > > >>>> inode: *mut bindings::inode, > > >>>> file_ptr: *mut bindings::file, > > >>>> ) -> core::ffi::c_int { > > >>>> // SAFETY: The `rust_binderfs.c` file ensures that `i_private` is set to the return value of a > > >>>> // successful call to `rust_binder_new_device`. > > >>>> let ctx = unsafe { Arc::<Context>::borrow((*inode).i_private) }; > > >>>> > > >>>> // SAFETY: The caller provides a valid file pointer to a new `struct file`. > > >>>> let file = unsafe { File::from_ptr(file_ptr) }; > > >>> > > >>> We need a better name for this helper than from_ptr() imho. I think > > >>> from_ptr() and as_ptr() is odd for C. How weird would it be to call > > >>> that from_raw_file() and as_raw_file()? > > >> > > >> That's a reasonable name. I would be happy to rename to that, and I > > >> don't think it is unidiomatic. > > > > > > Thanks! > > > > > >> > > >>> But bigger picture I somewhat struggle with the semantics of this > > >>> because this is not an interface that we have in C and this is really > > >>> about a low-level contract between C and Rust. Specifically this states > > >>> that this pointer is _somehow_ guaranteed valid. And really, this seems > > >>> a bit of a hack. > > >> > > >> Indeed ... but I think it's a quite common hack. After all, any time you > > >> dereference a raw pointer in Rust, you are making the same assumption. > > >> > > >>> Naively, I think this should probably not be necessary if > > >>> file_operations are properly wrapped. Or it should at least be demotable > > >>> to a purely internal method that can't be called directly or something. > > >> > > >> Yes, the usage here of File::from_ptr could probably be hidden inside a > > >> suitably designed file_operations wrapper. The thing is, Rust Binder > > >> doesn't currently use such a wrapper at all. It just exports a global of > > >> type file_operations and the C code in binderfs then references that > > >> global. > > > > > > Yeah. > > > > > >> > > >> Rust Binder used to use such an abstraction, but I ripped it out before > > >> sending the Rust Binder RFC because it didn't actually help. It was > > >> designed for cases where the file system is also implemented in Rust, so > > >> to get it to expose a file_operations global to the C code in binderfs, > > >> I had to reach inside its internal implementation. It did not save me > > >> from doing stuff such as using File::from_ptr from Binder. > > >> > > >> Now, you could most likely modify those file_operations abstractions to > > >> support my use-case better. But calling into C is already unsafe, so > > >> unless we get multiple drivers that have a similar C/Rust split, it's > > >> not clear that it's useful to extract the logic from Binder. I would > > >> prefer to wait for the file_operations abstraction to get upstreamed by > > >> the people working on VFS bindings, and then we can decide whether we > > >> should rewrite binderfs into Rust and get rid of the logic, or whether > > >> it's worth to expand the file_operations abstraction to support split > > >> C/Rust drivers like the current binderfs. > > >> > > >>> So what I mean is. fdget() may or may not take a reference. The C > > >>> interface internally knows whether a reference is owned or not by > > >>> abusing the lower two bits in a pointer to keep track of that. Naively, > > >>> I would expect the same information to be available to rust so that it's > > >>> clear to Rust wheter it's dealing with an explicitly referenced file or > > >>> an elided-reference file. Maybe that's not possible and I'm not > > >>> well-versed enough to see that yet. > > >> > > >> I'm sure Rust can access the same information, but I don't think I'm > > >> currently doing anything that cares about the distinction? > > > > > > Ok. My main goal is that we end up with an almost 1:1 correspondence > > > between the Rust and C interface so it's easy for current maintainers > > > and developers that don't want to care about Rust to continue to do so > > > and also just somewhat verify that changes they do are sane. > > > > Sure, that goal makes total sense to me. > > > > Alice