On Tue, 7 Nov 2023 at 21:56, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote: > > On Tue, Nov 07, 2023 at 09:35:44PM -0300, Wedson Almeida Filho wrote: > > > > While we are at it, lookup is also very much not a per-filesystem operation. > > > Take a look at e.g. procfs, for an obvious example... > > > > The C api offers the greatest freedom: one could write a file system > > where each file has its own set of mapping_ops, inode_ops and > > file_ops; and while we could choose to replicate this freedom in Rust > > but we haven't. > > Too bad. > > > Mostly because we don't need it, and we've been repeatedly told (by > > Greg KH and others) not to introduce abstractions/bindings for > > anything for which there isn't a user. Besides being a longstanding > > rule in the kernel, they also say that they can't reasonably decide if > > the interfaces are good if they can't see the users. > > The interfaces are *already* there. If it's going to be a separate > set of operations for Rust and for the rest of the filesystems, we > have a major problem right there. The interfaces will be different but compatible -- they boil down to calls to/from C and follow all rules and requirements imposed by C. We just use Rust's type system to encode more of the rules into the interfaces so that the compiler will catch more bugs for us at compile time (and avoid memory safety issues if developers stay away from unsafe blocks). For example, if you want to attach non-zero-sized data to inodes of a given filesystem, in Rust we have a generic type: struct INodeWithData<T> { data: MaybeUninit<T>, inode: bindings::inode, } And we automatically implement alloc_inode() and destroy_inode() in super_operations. And all inodes in callbacks are typed so that developers never need to call container_of directly themselves. The compiler will catch, at compile time, any type mismatches without runtime cost. Another example: instead of implementing functions then declaring structs containing pointers to these functions (and potentially other fields), in Rust we expose "traits" that developers need to implement. Then we can control which functions are required/optional and allow developers to logically group them, as well as declare constants and related types (e.g., the additional struct, if any, to be allocated along with an inode in INodeWithData above). But in the end, these get translated (at compile time for const ops) into file_operations/address_space_operations/inode_operations. > > The existing Rust users (tarfs and puzzlefs) only need a single > > lookup. And a quick grep (git grep \\\.lookup\\\> -- fs/) appears to > > show that the vast majority of C filesystems only have a single lookup > > as well. So we choose simplicity, knowing well that we may have to > > revisit it in the future if the needs change. > > > > > Wait a minute... what in name of everything unholy is that thing doing tied > > > to inodes in the first place? > > > > For the same reason as above, we don't need it in our current > > filesystems. A bunch of C ones (e.g., xfs, ext2, romfs, erofs) only > > use the dentry to get the name and later call d_splice_alias(), so we > > hide the name extraction and call to d_splice_alias() in the > > "trampoline" function. > > What controls the lifecycle of that stuff from the Rust point of view? The same rules as C. inodes, for example, are ref-counted so while a callback that has an inode as argument is inflight, we know it (the inode) is referenced and we can just use it. If/when Rust code wants to hold on to a pointer to it beyond a callback, it needs to increment the refcount and release it when it's done. Here the type system also helps us: it guarantees that pointers to ref-counted objects are never dangling, if we ever try to hold on to a pointer without incrementing the refcount, we get a compile-time error (no additional runtime cost). We also have a common interface for _all_ C ref-counted objects, so instead of having to memorise that I should call ihold/iget for inodes, folio_get/folio_put for folios, get_task_struct/put_task_struct for tasks, etc., in Rust they're simply ARef<INode>, ARef<Folio>, ARef<Task> with automatic increment via clone() and decrement on destruction. There are a couple of issues that I alluded to in the cover letter but never actually wrote down, so I will describe them here to get your thoughts: First issue: VFS conflates filesystem unregistration with module unload. The description of unregister_filesystem() states: * Once this function has returned the &struct file_system_type structure * may be freed or reused. When a filesystem is mounted, VFS calls get_filesystem() to presumably prevent the filesystem from unregistering, and it calls put_filesystem() when deactivating a super-block. But get/put_filesystem() are implemented as module_get/put(). So this works well if unregister_filesystem is only ever called when modules are unloaded. It doesn't seem to help if it's called anywhere else (e.g., on failure paths of module load). Here's an example: init_f2fs_fs() calls init_inodecache() to allocate an inode cache, then eventually calls register_filesystem(). Let's suppose at this point, another CPU actually mounts an instance of an f2fs fs. After register_filesystem() in init_f2fs_fs(), there is a bunch of extra failure paths; let's suppose f2fs_init_compress_mempool() fails. The exit path will call unregister_filesystem() (which prevents new superblocks from being created, but the existing superblocks continue to exist), then eventually destroy_inodecache(), which frees the cache from which all inodes of the existing superblock have been allocated and we have a bunch of potential user-after-frees. Granted that the module will not be unloaded immediately (it will wait for all references, including the ones by get_filesystem() to go away), so we won't have an issue with the callbacks being called to unloaded memory. But if we recycle f2fs_fs_type, which unregister_filesystem claims to be safe, we'll also have user-after-frees there. (Note that the example above doesn't require unload at all.) I think we could fix this by having a different implementation of get/put_filesystem() that keeps track of a count for filesystem usage (in addition to avoiding module unload), and only completing unregister_filesystem when it goes to zero. Would you be interested in a patch for this? Second issue: Leaked inodes: if a filesystem leaks inodes, then after unregistration most implementations will just free the kmemcache from which they came, so future attempts to use these leaked inodes (it's possible they've been stored in some list somewhere) will lead to user-after-frees. Is there anything we can do improve this? Should we prevent unregister_filesystem() from completing in such cases? Thanks, -Wedson