On Tue, Feb 19, 2019 at 12:51 PM Boaz harrosh <boaz@xxxxxxxxxxxxx> wrote: > > +4. An FS operation like create or WRITE/READ and so on arrives from application > + via VFS. Eventually an Operation is dispatched to zus: > + ▪ A special per-operation descriptor is filled up with all parameters. > + ▪ A current CPU channel is grabbed. the operation descriptor is put on > + that channel (ZT). Including get_user_pages or Kernel-pages associated > + with this OPT. > + ▪ The ZT is awaken, app thread put to sleep. > + ▪ In ZT context pages are mapped to that ZT-vma. This is so we are sure > + the map is only on a single core. And no other core's TLB is affected. > + (This here is the all performance secret) I still don't get it. You say mapping the page to server address space is the performance secret. I say it's a security issue as well as being perfectly useless, except for special cases. So let's see. There's the pmem case, which seems to be what this is mostly about. So we take a pmem filesystem and e.g. a read(2) syscall. There's no zero copy to talk about in that case since data needs to be copied from pmem to application buffer. If we want to minimize memory copies, than it will be a single memory copy from pmem to the app buffer. Your implementation chooses to do this copy in the userland server, via the application buffer mapped into its address space. But the memory copy could just as well be done by the kernel, from one virtual address to another; the kernel just has to juggle with physical page lookups, but does not have to establish a new mapping, which makes it all the more secure and performant. Sure, we've heard about arguments why the above doesn't work if the data comes from e.g. a network where the network driver could be writing data directly to application buffer. But I've shown how that could also be done with a new rsplice() interface, again without having to insert the application page into the server address space. And no, this doesn't have to involve syscall overhead, if that turns out to be the limiting factor: mapped pipes might be an interesting new concept as well. The only way I see this implementation actually save a memory copy is if a transformation is required (which again, already implies at least a single memory copy) such as compression or encryption. But even that could be delegated to the kernel, since it does have plenty of such transformations implemented, only an interface is required to tell the kernel which one to do on the buffer. Thanks, Miklos