Arnd, >> From: Xin Xiaohui <xiaohui.xin@xxxxxxxxx> >> >> Add a device to utilize the vhost-net backend driver for >> copy-less data transfer between guest FE and host NIC. >> It pins the guest user space to the host memory and >> provides proto_ops as sendmsg/recvmsg to vhost-net. >Sorry for taking so long before finding the time to look >at your code in more detail. >It seems that you are duplicating a lot of functionality that >is already in macvtap. I've asked about this before but then >didn't look at your newer versions. Can you explain the value >of introducing another interface to user land? >I'm still planning to add zero-copy support to macvtap, >hopefully reusing parts of your code, but do you think there >is value in having both? I have not looked into your macvtap code in detail before. Does the two interface exactly the same? We just want to create a simple way to do zero-copy. Now it can only support vhost, but in future we also want it to support directly read/write operations from user space too. Basically, compared to the interface, I'm more worried about the modification to net core we have made to implement zero-copy now. If this hardest part can be done, then any user space interface modifications or integrations are more easily to be done after that. >> diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c >> new file mode 100644 >> index 0000000..86d2525 >> --- /dev/null >> +++ b/drivers/vhost/mpassthru.c > >@@ -0,0 +1,1264 @@ > >+ > >+#ifdef MPASSTHRU_DEBUG >> +static int debug; >> + >> +#define DBG if (mp->debug) printk > >+#define DBG1 if (debug == 2) printk > >+#else > >+#define DBG(a...) > >+#define DBG1(a...) > >+#endif >This should probably just use the existing dev_dbg/pr_debug infrastructure. Thanks. Will try that. > [... skipping buffer management code for now] > >+static int mp_sendmsg(struct kiocb *iocb, struct socket *sock, > >+ struct msghdr *m, size_t total_len) > >+{ > >[...] >This function looks like we should be able to easily include it into >macvtap and get zero-copy transmits without introducing the new >user-level interface. >> +static int mp_recvmsg(struct kiocb *iocb, struct socket *sock, >> + struct msghdr *m, size_t total_len, >> + int flags) >> +{ >> + struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp; >> + struct page_ctor *ctor; >> + struct vhost_virtqueue *vq = (struct vhost_virtqueue *)(iocb->private); >It smells like a layering violation to look at the iocb->private field >from a lower-level driver. I would have hoped that it's possible to implement >this without having this driver know about the higher-level vhost driver >internals. Can you explain why this is needed? I don't like this too, but since the kiocb is maintained by vhost with a list_head. And mp device is responsible to collect the kiocb into the list_head, We need something known by vhost/mp both. >> + spin_lock_irqsave(&ctor->read_lock, flag); >> + list_add_tail(&info->list, &ctor->readq); > >+ spin_unlock_irqrestore(&ctor->read_lock, flag); > >+ > >+ if (!vq->receiver) { > >+ vq->receiver = mp_recvmsg_notify; > >+ set_memlock_rlimit(ctor, RLIMIT_MEMLOCK, > >+ vq->num * 4096, > >+ vq->num * 4096); > >+ } > >+ > >+ return 0; > >+} >Not sure what I'm missing, but who calls the vq->receiver? This seems >to be neither in the upstream version of vhost nor introduced by your >patch. See Patch v3 2/3 I have sent out, it is called by handle_rx() in vhost. >> +static void __mp_detach(struct mp_struct *mp) >> +{ > >+ mp->mfile = NULL; > >+ > >+ mp_dev_change_flags(mp->dev, mp->dev->flags & ~IFF_UP); > >+ page_ctor_detach(mp); > >+ mp_dev_change_flags(mp->dev, mp->dev->flags | IFF_UP); > >+ > >+ /* Drop the extra count on the net device */ > >+ dev_put(mp->dev); > >+} > >+ > >+static DEFINE_MUTEX(mp_mutex); > >+ > >+static void mp_detach(struct mp_struct *mp) > >+{ > >+ mutex_lock(&mp_mutex); > >+ __mp_detach(mp); > >+ mutex_unlock(&mp_mutex); > >+} > >+ > >+static void mp_put(struct mp_file *mfile) > >+{ > >+ if (atomic_dec_and_test(&mfile->count)) > >+ mp_detach(mfile->mp); > >+} > >+ > >+static int mp_release(struct socket *sock) > >+{ > >+ struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp; > >+ struct mp_file *mfile = mp->mfile; > >+ > >+ mp_put(mfile); > >+ sock_put(mp->socket.sk); > >+ put_net(mfile->net); > >+ > >+ return 0; > >+} >Doesn't this prevent the underlying interface from going away while the chardev >is open? You also have logic to handle that case, so why do you keep the extra >reference on the netdev? Let me think. >> +/* Ops structure to mimic raw sockets with mp device */ >> +static const struct proto_ops mp_socket_ops = { >> + .sendmsg = mp_sendmsg, >> + .recvmsg = mp_recvmsg, >> + .release = mp_release, >> +}; >> +static int mp_chr_open(struct inode *inode, struct file * file) >> +{ >> + struct mp_file *mfile; >>+ cycle_kernel_lock(); >I don't think you really want to use the BKL here, just kill that line. >> +static long mp_chr_ioctl(struct file *file, unsigned int cmd, >> + unsigned long arg) >> +{ >> + struct mp_file *mfile = file->private_data; >> + struct mp_struct *mp; >> + struct net_device *dev; >> + void __user* argp = (void __user *)arg; >> + struct ifreq ifr; >> + struct sock *sk; >> + int ret; >> + >> + ret = -EINVAL; >> + >> + switch (cmd) { >> + case MPASSTHRU_BINDDEV: >> + ret = -EFAULT; >> + if (copy_from_user(&ifr, argp, sizeof ifr)) >> + break; >This is broken for 32 bit compat mode ioctls, because struct ifreq >is different between 32 and 64 bit systems. Since you are only >using the device name anyway, a fixed length string or just the >interface index would be simpler and work better. Thanks, will look into this. >> + ifr.ifr_name[IFNAMSIZ-1] = '\0'; >> + >> + ret = -EBUSY; >> + >> + if (ifr.ifr_flags & IFF_MPASSTHRU_EXCL) >> + break; >Your current use of the IFF_MPASSTHRU* flags does not seem to make >any sense whatsoever. You check that this flag is never set, but set >it later yourself and then ignore all flags. Using that flag is tried to prevent if another one wants to bind the same device Again. But I will see if it really ignore all other flags. >> + ret = -ENODEV; >> + dev = dev_get_by_name(mfile->net, ifr.ifr_name); >> + if (!dev) >> + break; >There is no permission checking on who can access what device, which >seems a bit simplistic. Any user that has access to the mpassthru device >seems to be able to bind to any network interface in the namespace. >This is one point where the macvtap model seems more appropriate, it >separates the permissions for creating logical interfaces and using them. Yes, that's a problem I have not addressed yet. >> +static ssize_t mp_chr_aio_write(struct kiocb *iocb, const struct iovec *iov, >> + unsigned long count, loff_t pos) >> +{ >> + struct file *file = iocb->ki_filp; >> + struct mp_struct *mp = mp_get(file->private_data); >> + struct sock *sk = mp->socket.sk; >> + struct sk_buff *skb; >> + int len, err; >> + ssize_t result; >Can you explain what this function is even there for? AFAICT, vhost-net >doesn't call it, the interface is incompatible with the existing >tap interface, and you don't provide a read function. I saw Michael have given the answer already. >> diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h >> new file mode 100644 >> index 0000000..2be21c5 >> --- /dev/null >> +++ b/include/linux/mpassthru.h >> @@ -0,0 +1,29 @@ >> +#ifndef __MPASSTHRU_H >> +#define __MPASSTHRU_H >> + > >+#include <linux/types.h> >> +#include <linux/if_ether.h> >> + >> +/* ioctl defines */ >> +#define MPASSTHRU_BINDDEV _IOW('M', 213, int) > >+#define MPASSTHRU_UNBINDDEV _IOW('M', 214, int) >These definitions are slightly wrong, because you pass more than just an 'int'. Thanks. I wrote them too quickly. :-( >> +/* MPASSTHRU ifc flags */ >> +#define IFF_MPASSTHRU 0x0001 >> +#define IFF_MPASSTHRU_EXCL 0x0002 >As mentioned above, these flags don't make any sense with your current code. I used them try to prevent the one who want to bind the same device again. Arnd -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html