Re: [PATCH v5 3/9] rust: file: add Rust abstraction for `struct file`

Christian Brauner <brauner@xxxxxxxxxx> · Wed, 3 Apr 2024 08:01:50 +0200

On Tue, Apr 02, 2024 at 09:39:57AM +0000, Alice Ryhl wrote:
> Christian Brauner <brauner@xxxxxxxxxx> wrote:
> > On Mon, Apr 01, 2024 at 12:09:08PM +0000, Alice Ryhl wrote:
> >> Christian Brauner <brauner@xxxxxxxxxx> wrote:
> >>> On Wed, Mar 20, 2024 at 06:09:05PM +0000, Alice Ryhl wrote:
> >>>> Christian Brauner <brauner@xxxxxxxxxx> wrote:
> >>>>> On Fri, Feb 09, 2024 at 11:18:16AM +0000, Alice Ryhl wrote:
> >>>>>> +/// Wraps the kernel's `struct file`.
> >>>>>> +///
> >>>>>> +/// This represents an open file rather than a file on a filesystem. Processes generally reference
> >>>>>> +/// open files using file descriptors. However, file descriptors are not the same as files. A file
> >>>>>> +/// descriptor is just an integer that corresponds to a file, and a single file may be referenced
> >>>>>> +/// by multiple file descriptors.
> >>>>>> +///
> >>>>>> +/// # Refcounting
> >>>>>> +///
> >>>>>> +/// Instances of this type are reference-counted. The reference count is incremented by the
> >>>>>> +/// `fget`/`get_file` functions and decremented by `fput`. The Rust type `ARef<File>` represents a
> >>>>>> +/// pointer that owns a reference count on the file.
> >>>>>> +///
> >>>>>> +/// Whenever a process opens a file descriptor (fd), it stores a pointer to the file in its `struct
> >>>>>> +/// files_struct`. This pointer owns a reference count to the file, ensuring the file isn't
> >>>>>> +/// prematurely deleted while the file descriptor is open. In Rust terminology, the pointers in
> >>>>>> +/// `struct files_struct` are `ARef<File>` pointers.
> >>>>>> +///
> >>>>>> +/// ## Light refcounts
> >>>>>> +///
> >>>>>> +/// Whenever a process has an fd to a file, it may use something called a "light refcount" as a
> >>>>>> +/// performance optimization. Light refcounts are acquired by calling `fdget` and released with
> >>>>>> +/// `fdput`. The idea behind light refcounts is that if the fd is not closed between the calls to
> >>>>>> +/// `fdget` and `fdput`, then the refcount cannot hit zero during that time, as the `struct
> >>>>>> +/// files_struct` holds a reference until the fd is closed. This means that it's safe to access the
> >>>>>> +/// file even if `fdget` does not increment the refcount.
> >>>>>> +///
> >>>>>> +/// The requirement that the fd is not closed during a light refcount applies globally across all
> >>>>>> +/// threads - not just on the thread using the light refcount. For this reason, light refcounts are
> >>>>>> +/// only used when the `struct files_struct` is not shared with other threads, since this ensures
> >>>>>> +/// that other unrelated threads cannot suddenly start using the fd and close it. Therefore,
> >>>>>> +/// calling `fdget` on a shared `struct files_struct` creates a normal refcount instead of a light
> >>>>>> +/// refcount.
> >>>>> 
> >>>>> When the fdget() calling task doesn't have a shared file descriptor
> >>>>> table fdget() will not increment the reference count, yes. This
> >>>>> also implies that you cannot have task A use fdget() and then pass
> >>>>> f.file to task B that holds on to it while A returns to userspace. It's
> >>>>> irrelevant that task A won't drop the reference count or that task B
> >>>>> won't drop the reference count. Because task A could return back to
> >>>>> userspace and immediately close the fd via a regular close() system call
> >>>>> at which point task B has a UAF. In other words a file that has been
> >>>>> gotten via fdget() can't be Send to another task without the Send
> >>>>> implying taking a reference to it.
> >>>> 
> >>>> That matches my understanding.
> >>>> 
> >>>> I suppose that technically you can still send it to another thread *if* you
> >>>> ensure that the current thread waits until that other thread stops using the
> >>>> file before returning to userspace.
> >>> 
> >>> _Technically_ yes, but it would be brittle as hell. The problem is that
> >>> fdget() _relies_ on being single-threaded for the time that fd is used
> >>> until fdput(). There's locking assumptions that build on that e.g., for
> >>> concurrent read/write. So no, that shouldn't be allowed.
> >>> 
> >>> Look at how this broke our back when we introduced pidfd_getfd() where
> >>> we steal an fd from another task. I have a lengthy explanation how that
> >>> can be used to violate our elided-locking which is based on assuming
> >>> that we're always single-threaded and the file can't be suddenly shared
> >>> with another task. So maybe doable but it would make the semantics even
> >>> more intricate.
> >> 
> >> Hmm, the part about elided locking is surprising to me, and may be an
> >> issue. Can you give more details on that?  Because the current
> > 
> > So what I referred to was that we do have fdget_pos(). Roughly, if
> > there's more than one reference on the file then we need to acquire a
> > mutex but if it's only a single reference then we can avoid taking the
> > mutex because we know that we're the only one that has a reference to
> > that file and no one else can acquire one. Whether or not that mutex was
> > taken is taken track of in struct fd.
> > 
> > So you can't share a file after fdget_pos() has been called on it and
> > you haven't taken the position mutex. So let's say you had:
> > 
> > * Tread A that calls fdget_pos() on file1 and the reference count is
> >   one. So Thread A doesn't acquire the file position mutex for file1.
> > * Now somehow that file1 becomes shared, e.g., Thread B calls fget() on
> >   it and now Thread B does some operation that requires the file
> >   position mutex.
> > => Thread A and Thread B race on the file position.
> > 
> > So just because you have a reference to a file from somewhere it doesn't
> > mean you can just share it with another thread.
> > 
> > So if yo have an arbitrary reference to a file in Rust and that somehow
> > can be shared with another thread you risk races here.
> > 
> >> abstractions here *do* actually allow what I described, since we
> >> implement Sync for File.
> >> 
> >> I'm not familiar with the pidfd_getfd discussion you are referring to.
> >> Do you have a link?
> > 
> > https://lore.kernel.org/linux-fsdevel/20230724-vfs-fdget_pos-v1-1-a4abfd7103f3@xxxxxxxxxx
> > 
> > pidfd_getfd() can be used to steal a file descriptor from another task.
> > It's like a non-cooperative SCM_RIGHTS. That means you can have exactly
> > the scenario described above where a file assumed to be non-shared is
> > suddenly shared and you have racing reads/writes.
> > 
> > For readdir we nowadays always take the file position mutex because of
> > the pidfd_getfd() business because that might corrupt internal state.
> > 
> >> 
> >> I'm thinking that we may have to provide two different `struct file`
> >> wrappers to accurately model this API in Rust. Perhaps they could be
> >> called File and LocalFile, where one is marked as thread safe and the
> >> other isn't. I can make all LocalFile methods available on File to avoid
> >> having to duplicate methods that are available on both.
> > 
> > But isn't that just struct file and struct fd? Ideally we'd stay close
> > to something like this.
> 
> Right, that kind of naming seems sensible. But I still need to
> understand the details a bit better. See below on fdget_pos.
> 
> >> But it's not clear to me that this is even enough. Even if we give you a
> >> &LocalFile to prevent you from moving it across threads, you can just
> >> call File::fget to get an ARef<File> to the same file and then move
> >> *that* across threads.
> > 
> > Yes, absolutely.
> 
> One of my challenges is that Binder wants to call File::fget,
> immediately move it to another thread, and then call fd_install. And
> it would be pretty unfortunate if that requires unsafe. But like I argue
> below, it seems hard to design a safe API for this in the face of
> fdget_pos.
> 
> >> This kind of global requirement is not so easy to model. Maybe klint [1]
> >> could do it ... atomic context violations are a similar kind of global
> >> check. But having klint do it would be far out.
> >> 
> >> Or maybe File::fget should also return a LocalFile?
> >> 
> >> But this raises a different question to me. Let's say process A uses
> >> Binder to send its own fd to process B, and the following things happen:
> >> 
> >> 1. Process A enters the ioctl and takes fdget on the fd.
> >> 2. Process A calls fget on the same fd to send it to another process.
> >> 3. Process A goes to sleep, waiting for process B to respond.
> >> 4. Process B receives the message, installs the fd, and returns to userspace.
> >> 5. Process B responds to the transaction, but does not close the fd.
> > 
> > The fd just installed in 4. and the fd you're referring to in 5. are
> > identical, right? IOW, we're not talking about two different fd (dup)
> > for the same file, right?
> 
> I'm referring to whatever fd_install does given the `struct file` I got
> from fget in step 2.
> 
> >> 6a. Process A finishes sleeping, and returns to userspace from the ioctl.
> >> 6b. Process B tries to do an operation (e.g. read) on the fd.
> >> 
> >> Let's say that 6a and 6b run in parallel.
> >> 
> >> Could this potentially result in a data race between step 6a and 6b? I'm
> >> guessing that step 6a probably doesn't touch any of the code that has
> >> elided locking assumptions, so in practice I guess there's not a problem
> >> ... but if you make any sort of elided locking assumption as you exit
> >> from the ioctl (before reaching the fdput), then it seems to me that you
> >> have a problem.
> > 
> > Yes, 6a doesn't touch any code that has elided locking assumptions.
> > 
> > 1'.  Process A enters the ioctl and takes fdget() on the fd. Process A
> >      holds the only reference to that file and the file descriptor table
> >      isn't shared. Therefore, f_count is left untouched and remains at 1.
> > 2'.  Process A calls fget() which unconditionally bumps f_count bringing
> >      it to 2 and sending it another process (Presumably you intend to
> >      imply that this reference is now owned by the second process.).
> > 3'.  [as 3.]
> > 4'.  Process B installs the file into it's file descriptor table
> >      consuming that reference from 2'. The f_count remains at 2 with the
> >      reference from 2' now being owned by Process B.
> > 5'.  Since Process B isn't closing the fd and has just called
> >      fd_install() it returns to userspace with f_count untouched and
> >      still at 2.
> > 6'a. Process A finishes sleeping and returns to userspace calling
> >      fdput(). Since the original fdget() was done without bumping the
> >      reference count the fdput() of Process A will not decrement the
> >      reference count. So f_count remains at 2.
> > 6'b. Process B performs a read/write syscall and calls fdget_pos().
> >      fdget_pos() sees that this file has f_count > 1 and takes the
> >      file position mutex.
> > 
> > So this isn't a problem. The problem is when a file becomes shared
> > implicitly without the original owner of the file knowing.
> 
> Hmm. Yes, but the ioctl code that called fdget doesn't really know that
> the ioctl shared the file? So why is it okay?

Why does it matter to the ioctl() code itself? The ioctl() code never
calls fdget_pos().

> 
> It really seems like there are two different things going on here. When
> it comes to fdget, we only really care about operations that could
> remove it from the local file descriptor table, since fdget relies on
> the refcount in that table remaining valid until fdput.

Yes.

> 
> On the other hand, for fdget_pos it also matters whether it gets
> installed in other file descriptor tables. Threads that reference it
> through a different fd table will still access the same position.

Yes, they operate on the same f_pos.

> 
> And so this means that between fdget/fdput, there's never any problem
> with installing the `struct file` into another file descriptor table.
> Nothing you can do from that other fd table could cause the local fd
> table to drop its refcount on the file. Whereas such an install can be
> a problem between fdget_pos/fdput_pos, since that could introduce a race
> on the position.
> 
> Is this correct?

Yes, but that would imply you're sharing and installing a file into a
file descriptor table from a read/write/seek codepath. I don't see how
this can happen without something like e.g., pidfd_getfd(). And the
fd_install()ing task would then have to go back to userspace and issue a
concurrent read/write/seek system call while the other thread is still
reading/writing.

Overall, we really only care about f_pos consistency because posix
requires atomicity between reads/writes/seeks. For pidfd_getfd() where
such sharing can happen non-cooperatively we just don't care as we've
just declared this to be an instance where we're outside of posix
guarantees. And for readdir() we unconditionally acquire the mutex.

I think io_uring is racing on f_pos as well under certain circumstances
(REQ_F_CUR_POS?) as they don't use fdget_pos() at all. But iirc Jens
dislikes that they ever allowed that.

> 
> I was thinking that if we have some sort of File/LocalFile distinction
> (or File/Fd), then we may be able to get it to work by limiting what a
> File can do. For example, let's say that the only thing you can do with
> a File is install it into fd tables, then by the previous logic, there's
> no problem with it being safe to move across threads even if there's an
> active fdget.
> 
> But the fdget_pos kind of throws a wrench into that, because now I can
> no longer say "it's always safe to do File::fget, move it to another
> thread, and install it into the remote fd table", since that could cause
> races on the position if there's an active fdget_pos when we call
> File::fget.

I think I understand why that's a problem for you but let me try and
spell it out so I can understand where you're coming from:

You want the Rust compiler to reject a file becoming shared implicitly
from any codepath that is beneath fdget_pos() even though no such code
yet exists (ignoring pidfd_getfd()). So it's a correctness issue to you.

> 
> >>>>>> +///
> >>>>>> +/// Light reference counts must be released with `fdput` before the system call returns to
> >>>>>> +/// userspace. This means that if you wait until the current system call returns to userspace, then
> >>>>>> +/// all light refcounts that existed at the time have gone away.
> >>>>>> +///
> >>>>>> +/// ## Rust references
> >>>>>> +///
> >>>>>> +/// The reference type `&File` is similar to light refcounts:
> >>>>>> +///
> >>>>>> +/// * `&File` references don't own a reference count. They can only exist as long as the reference
> >>>>>> +///   count stays positive, and can only be created when there is some mechanism in place to ensure
> >>>>>> +///   this.
> >>>>>> +///
> >>>>>> +/// * The Rust borrow-checker normally ensures this by enforcing that the `ARef<File>` from which
> >>>>>> +///   a `&File` is created outlives the `&File`.
> >>>>> 
> >>>>> The section confuses me a little: Does the borrow-checker always ensure
> >>>>> that a &File stays valid or are there circumstances where it doesn't or
> >>>>> are you saying it doesn't enforce it?
> >>>> 
> >>>> The borrow-checker always ensures it.
> >>> 
> >>> Ok, thanks.
> >>> 
> >>>> 
> >>>> A &File is actually short-hand for &'a File, where 'a is some
> >>>> unspecified lifetime. We say that &'a File is annotated with 'a. The
> >>>> borrow-checker rejects any code that tries to use a reference after the
> >>>> end of the lifetime annotated on it.
> >>> 
> >>> Thanks for the explanation.
> >>> 
> >>>> 
> >>>> So as long as you annotate the reference with a sufficiently short
> >>>> lifetime, the borrow checker will prevent UAF. And outside of cases like
> >>> 
> >>> Sorry, but can you explain "sufficiently short lifetime"?
> >> 
> >> By "sufficiently short lifetime" I mean "lifetime that ends before the
> >> object is destroyed".
> > 
> > Ah, ok. It sounded like it was a specific concept that Rust is
> > implementing in contrast to long-term lifetime or sm. Thanks!
> > 
> >> 
> >> Idea being that if the lifetime ends before the object is freed, and the
> >> borrow-checker rejects attempts to use it after the lifetime ends, then
> >> it follows that the borrow-checker prevents use-after-frees.
> >> 
> >>>> from_ptr, the borrow-checker also takes care of ensuring that the
> >>>> lifetimes are sufficiently short.
> >>>> 
> >>>> (Technically &'a File and &'b File are two completely different types,
> >>>> so &File is technically a class of types and not a single type. Rust
> >>>> will automatically convert &'long File to &'short File.)
> >>>> 
> >>>>>> +///
> >>>>>> +/// * Using the unsafe [`File::from_ptr`] means that it is up to the caller to ensure that the
> >>>>>> +///   `&File` only exists while the reference count is positive.
> >>>>> 
> >>>>> What is this used for in binder? If it's not used don't add it.
> >>>> 
> >>>> This is used on the boundary between the Rust part of Binder and the
> >>>> binderfs component that is implemented in C. For example:
> >>> 
> >>> I see, I'm being foiled by my own code...
> >>> 
> >>>> 
> >>>> 	unsafe extern "C" fn rust_binder_open(
> >>>> 	    inode: *mut bindings::inode,
> >>>> 	    file_ptr: *mut bindings::file,
> >>>> 	) -> core::ffi::c_int {
> >>>> 	    // SAFETY: The `rust_binderfs.c` file ensures that `i_private` is set to the return value of a
> >>>> 	    // successful call to `rust_binder_new_device`.
> >>>> 	    let ctx = unsafe { Arc::<Context>::borrow((*inode).i_private) };
> >>>> 	
> >>>> 	    // SAFETY: The caller provides a valid file pointer to a new `struct file`.
> >>>> 	    let file = unsafe { File::from_ptr(file_ptr) };
> >>> 
> >>> We need a better name for this helper than from_ptr() imho. I think
> >>> from_ptr() and as_ptr() is odd for C. How weird would it be to call
> >>> that from_raw_file() and as_raw_file()?
> >>  
> >> That's a reasonable name. I would be happy to rename to that, and I
> >> don't think it is unidiomatic.
> > 
> > Thanks!
> > 
> >> 
> >>> But bigger picture I somewhat struggle with the semantics of this
> >>> because this is not an interface that we have in C and this is really
> >>> about a low-level contract between C and Rust. Specifically this states
> >>> that this pointer is _somehow_ guaranteed valid. And really, this seems
> >>> a bit of a hack.
> >> 
> >> Indeed ... but I think it's a quite common hack. After all, any time you
> >> dereference a raw pointer in Rust, you are making the same assumption.
> >> 
> >>> Naively, I think this should probably not be necessary if
> >>> file_operations are properly wrapped. Or it should at least be demotable
> >>> to a purely internal method that can't be called directly or something.
> >> 
> >> Yes, the usage here of File::from_ptr could probably be hidden inside a
> >> suitably designed file_operations wrapper. The thing is, Rust Binder
> >> doesn't currently use such a wrapper at all. It just exports a global of
> >> type file_operations and the C code in binderfs then references that
> >> global.
> > 
> > Yeah.
> > 
> >> 
> >> Rust Binder used to use such an abstraction, but I ripped it out before
> >> sending the Rust Binder RFC because it didn't actually help. It was
> >> designed for cases where the file system is also implemented in Rust, so
> >> to get it to expose a file_operations global to the C code in binderfs,
> >> I had to reach inside its internal implementation. It did not save me
> >> from doing stuff such as using File::from_ptr from Binder.
> >> 
> >> Now, you could most likely modify those file_operations abstractions to
> >> support my use-case better. But calling into C is already unsafe, so
> >> unless we get multiple drivers that have a similar C/Rust split, it's
> >> not clear that it's useful to extract the logic from Binder. I would
> >> prefer to wait for the file_operations abstraction to get upstreamed by
> >> the people working on VFS bindings, and then we can decide whether we
> >> should rewrite binderfs into Rust and get rid of the logic, or whether
> >> it's worth to expand the file_operations abstraction to support split
> >> C/Rust drivers like the current binderfs.
> >> 
> >>> So what I mean is. fdget() may or may not take a reference. The C
> >>> interface internally knows whether a reference is owned or not by
> >>> abusing the lower two bits in a pointer to keep track of that. Naively,
> >>> I would expect the same information to be available to rust so that it's
> >>> clear to Rust wheter it's dealing with an explicitly referenced file or
> >>> an elided-reference file. Maybe that's not possible and I'm not
> >>> well-versed enough to see that yet.
> >> 
> >> I'm sure Rust can access the same information, but I don't think I'm
> >> currently doing anything that cares about the distinction?
> > 
> > Ok. My main goal is that we end up with an almost 1:1 correspondence
> > between the Rust and C interface so it's easy for current maintainers
> > and developers that don't want to care about Rust to continue to do so
> > and also just somewhat verify that changes they do are sane.
> 
> Sure, that goal makes total sense to me.
> 
> Alice