On 3/3/20 12:02 PM, Jeff Layton wrote: > On Tue, 2020-03-03 at 09:55 -0700, Jens Axboe wrote: >> On 3/3/20 9:51 AM, Jeff Layton wrote: >>> On Tue, 2020-03-03 at 08:44 -0700, Jens Axboe wrote: >>>> On 3/3/20 7:24 AM, Greg Kroah-Hartman wrote: >>>>> On Tue, Mar 03, 2020 at 03:13:26PM +0100, Jann Horn wrote: >>>>>> On Tue, Mar 3, 2020 at 3:10 PM Greg Kroah-Hartman >>>>>> <gregkh@xxxxxxxxxxxxxxxxxxx> wrote: >>>>>>> On Tue, Mar 03, 2020 at 02:43:16PM +0100, Greg Kroah-Hartman wrote: >>>>>>>> On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote: >>>>>>>>> On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman >>>>>>>>> <gregkh@xxxxxxxxxxxxxxxxxxx> wrote: >>>>>>>>> >>>>>>>>>>> Unlimited beers for a 21-line kernel patch? Sign me up! >>>>>>>>>>> >>>>>>>>>>> Totally untested, barely compiled patch below. >>>>>>>>>> >>>>>>>>>> Ok, that didn't even build, let me try this for real now... >>>>>>>>> >>>>>>>>> Some comments on the interface: >>>>>>>> >>>>>>>> Ok, hey, let's do this proper :) >>>>>>> >>>>>>> Alright, how about this patch. >>>>>>> >>>>>>> Actually tested with some simple sysfs files. >>>>>>> >>>>>>> If people don't strongly object, I'll add "real" tests to it, hook it up >>>>>>> to all arches, write a manpage, and all the fun fluff a new syscall >>>>>>> deserves and submit it "for real". >>>>>> >>>>>> Just FYI, io_uring is moving towards the same kind of thing... IIRC >>>>>> you can already use it to batch a bunch of open() calls, then batch a >>>>>> bunch of read() calls on all the new fds and close them at the same >>>>>> time. And I think they're planning to add support for doing >>>>>> open()+read()+close() all in one go, too, except that it's a bit >>>>>> complicated because passing forward the file descriptor in a generic >>>>>> way is a bit complicated. >>>>> >>>>> It is complicated, I wouldn't recommend using io_ring for reading a >>>>> bunch of procfs or sysfs files, that feels like a ton of overkill with >>>>> too much setup/teardown to make it worth while. >>>>> >>>>> But maybe not, will have to watch and see how it goes. >>>> >>>> It really isn't, and I too thinks it makes more sense than having a >>>> system call just for the explicit purpose of open/read/close. As Jann >>>> said, you can't currently do a linked sequence of open/read/close, >>>> because the fd passing between them isn't done. But that will come in >>>> the future. If the use case is "a bunch of files", then you could >>>> trivially do "open bunch", "read bunch", "close bunch" in three separate >>>> steps. >>>> >>>> Curious what the use case is for this that warrants a special system >>>> call? >>>> >>> >>> Agreed. I'd really rather see something more general-purpose than the >>> proposed readfile(). At least with NFS and SMB, you can compound >>> together fairly arbitrary sorts of operations, and it'd be nice to be >>> able to pattern calls into the kernel for those sorts of uses. >>> >>> So, NFSv4 has the concept of a current_stateid that is maintained by the >>> server. So basically you can do all this (e.g.) in a single compound: >>> >>> open <some filehandle get a stateid> >>> write <using that stateid> >>> close <same stateid> >>> >>> It'd be nice to be able to do something similar with io_uring. Make it >>> so that when you do an open, you set the "current fd" inside the >>> kernel's context, and then be able to issue io_uring requests that >>> specify a magic "fd" value that use it. >>> >>> That would be a really useful pattern. >> >> For io_uring, you can link requests that you submit into a chain. Each >> link in the chain is done in sequence. Which means that you could do: >> >> <open some file><read from that file><close that file> >> >> in a single sequence. The only thing that is missing right now is a way >> to have the return of that open propagated to the 'fd' of the read and >> close, and it's actually one of the topics to discuss at LSFMM next >> month. >> >> One approach would be to use BPF to handle this passing, another >> suggestion has been to have the read/close specify some magic 'fd' value >> that just means "inherit fd from result of previous". The latter sounds >> very close to the stateid you mention above, and the upside here is that >> it wouldn't explode the necessary toolchain to need to include BPF. >> >> In other words, this is really close to being reality and practically >> feasible. >> > > Excellent. > > Yes, the latter is exactly what I had in mind for this. I suspect that > that would cover a large fraction of the potential use-cases for this. > > Basically, all you'd need to do is keep a pointer to struct file in the > internal state for the chain. Then, allow userland to specify some magic > fd value for subsequent chained operations that says to use that instead > of consulting the fdtable. Maybe use -4096 (-MAX_ERRNO - 1)? Yeah I think that'd be a suitable way to signal that. > That would cover the smb or nfs server sort of use cases, I think. For > the sysfs cases, I guess you'd need to dispatch several chains, but that > doesn't sound _too_ onerous. The magic fd would be per-chain, so doing multiple chains wouldn't really matter at all. Let me try and hack this up, should be pretty trivial. > In fact, with that you should even be able to emulate the proposed > readlink syscall in a userland library. Exactly -- Jens Axboe