On Fri, 8 Feb 2013 09:11:17 +0100 Martin Sustrik <sustrik@xxxxxxxxxx> wrote: > When implementing network protocols in user space, one has to implement > fake user-space file descriptors to represent the sockets for the protocol. > > While all the BSD socket API functionality for such descriptors may be faked as > well (myproto_send(), myproto_recv() etc.) this approach doesn't work for > polling (select, poll, epoll). And unfortunately, sockets that can't be polled > on allow only for building the simplest possible applications. Basically, you > can build a simple client, but once you want to implement a server handling > many sockets in parallel, you are stuck. > > However, to do polling, real system-level file descriptor is needed, > not a fake one. > > In theory, eventfd may be used for this purpose, except that it is well suited > only for signaling POLLIN. With some hacking it can be also used to signal > POLLOUT and POLLERR, but: > > I. There's no way to signal POLLPRI, POLLHUP etc. > II. There's no way to signal arbitraty combination of POLL* flags. Most notably, > !POLLIN & !POLLOUT, which is a perfectly valid combination for a network > protocol (rx buffer is empty and tx buffer is full), cannot be signaled > using current implementation of eventfd. > > This patch implements new EFD_MASK flag which attempts to solve this problem. > > Additionally, when implementing network protocols in user space, there's a > need to associate user-space state with the each "socket". If eventfd object is > used as a reference to the socket, it should be possible to associate an opaque > pointer to user-space data with it. > > The semantics of EFD_MASK are as follows: > > eventfd(2): > > If eventfd is created with EFD_MASK flag set, it is initialised in such a way > as to signal no events on the file descriptor when it is polled on. 'initval' > argument is ignored. > > write(2): > > User is allowed to write only buffers containing the following structure: > > struct efd_mask { > short events; > union { > void *ptr; > uint32_t u32; > uint64_t u64; > }; > }; > > The value of 'events' should be any combination of event flags as defined by > poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified events will > be signaled when polling (select, poll, epoll) on the eventfd is done later on. > ptr, u32 and u63 are opaque data that are not interpreted by eventfd object. > > read(2): > > User is allowed to read an efd_mask structure from the eventfd marked by > EFD_MASK. Returned value shall be the last one written to the eventfd. > > select(2), poll(2) and similar: > > When polling on the eventfd marked by EFD_MASK flag, all the events specified > in last written 'events' field shall be signaled. > > ... This patch adds userspace interfaces which will require manpage updates. Please Cc Michael and work with him on getting those changes completed. > > +/* On x86-64 keep the same binary layout as on i386. */ > +#ifdef __x86_64__ > +#define EVENTFD_MASK_PACKED __packed > +#else > +#define EVENTFD_MASK_PACKED > +#endif > + > +struct eventfd_mask { > + __u32 events; > + __u64 data; > +} EVENTFD_MASK_PACKED; The x86-64 specific thing is ugly. I can find no explanation of why it was done, but it should go away. You could make `events' a u64, or swap the order of the two fields and make the struct __packed on all architectures. Given that the size of the types is fixed, I see no compat issues here. As this struct is known by userspace, this definition should appear in a header which is available to usersapce: include/uapi/... > struct eventfd_ctx { > struct kref kref; > wait_queue_head_t wqh; > - /* > - * Every time that a write(2) is performed on an eventfd, the > - * value of the __u64 being written is added to "count" and a > - * wakeup is performed on "wqh". A read(2) will return the "count" > - * value to userspace, and will reset "count" to zero. The kernel > - * side eventfd_signal() also, adds to the "count" counter and > - * issue a wakeup. > - */ > - __u64 count; > + union { > + /* > + * Every time that a write(2) is performed on an eventfd, the > + * value of the __u64 being written is added to "count" and a > + * wakeup is performed on "wqh". A read(2) will return the > + * "count" value to userspace, and will reset "count" to zero. > + * The kernel side eventfd_signal() also, adds to the "count" > + * counter and issue a wakeup. > + */ > + __u64 count; > + struct eventfd_mask mask; The nice explanation for `count' was retained, but is it appropriate that `mask' have no explanation? > + }; > unsigned int flags; > }; > > ... > > @@ -230,13 +261,23 @@ static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count, > ssize_t res; > __u64 cnt; > > - if (count < sizeof(cnt)) > - return -EINVAL; > - res = eventfd_ctx_read(ctx, file->f_flags & O_NONBLOCK, &cnt); > - if (res < 0) > + if (ctx->flags & EFD_MASK) { > + spin_lock_irq(&ctx->wqh.lock); > + if (count < sizeof(ctx->mask)) > + return -EINVAL; > + res = copy_to_user(buf, &ctx->mask, sizeof(ctx->mask)) ? > + -EFAULT : sizeof(ctx->mask); > + spin_unlock_irq(&ctx->wqh.lock); > return res; This code is crawling with bugs. - can return with wqh.lock held -> dead kernel - performs copy_to_user() under spinlock -> warning spew, kernel deadlocks. It should go via a local temporary, as was done in eventfd_write(). This should have filled your screen with warnings when testing. Either it wasn't tested or its author forgot to read Documentation/SubmitChecklist section 12. Please do so ;) (otoh maybe might_sleep and lockdep fail to detect copy_*_user under spinlock when the copy doesn't fault. If so, that's a big fail) > - > - return put_user(cnt, (__u64 __user *) buf) ? -EFAULT : sizeof(cnt); > + } else { > + if (count < sizeof(cnt)) > + return -EINVAL; > + res = eventfd_ctx_read(ctx, file->f_flags & O_NONBLOCK, &cnt); > + if (res < 0) > + return res; > + return put_user(cnt, (__u64 __user *) buf) ? > + -EFAULT : sizeof(cnt); > + } > } > > static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t count, > @@ -246,6 +287,23 @@ static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t c > ssize_t res; > __u64 ucnt; > DECLARE_WAITQUEUE(wait, current); > + struct eventfd_mask mask; > + > + if (ctx->flags & EFD_MASK) { > + if (count < sizeof(mask)) > + return -EINVAL; > + if (copy_from_user(&mask, buf, sizeof(mask))) > + return -EFAULT; > + if (mask.events & ~EFD_MASK_VALID_EVENTS) > + return -EINVAL; > + spin_lock_irq(&ctx->wqh.lock); > + memcpy(&ctx->mask, &mask, sizeof(ctx->mask)); > + if (waitqueue_active(&ctx->wqh)) > + wake_up_locked_poll(&ctx->wqh, > + (unsigned long)ctx->mask.events); > + spin_unlock_irq(&ctx->wqh.lock); > + return sizeof(ctx->mask); > + } `mask' can be made local to this `if' block, which is nicer. > if (count < sizeof(ucnt)) > return -EINVAL; > @@ -293,8 +351,13 @@ static int eventfd_show_fdinfo(struct seq_file *m, struct file *f) > int ret; > > spin_lock_irq(&ctx->wqh.lock); > - ret = seq_printf(m, "eventfd-count: %16llx\n", > - (unsigned long long)ctx->count); > + if (ctx->flags & EFD_MASK) { > + ret = seq_printf(m, "eventfd-mask: %x\n", > + (unsigned)ctx->mask.events); > + } else { > + ret = seq_printf(m, "eventfd-count: %16llx\n", > + (unsigned long long)ctx->count); > + } > spin_unlock_irq(&ctx->wqh.lock); This is a non-back-compatible userspace interface change. A procfs file which previously displayed eventfd-count: nnnn can now also display eventfd-mask: nnnn So existing userspace could misbehave. Please fully describe the proposed interface change in the changelog. That description should include the full pathname of the procfs file and example before-and-after output and a discussion of whether and why the risk to existing userspace is acceptable. > @@ -412,7 +475,12 @@ struct file *eventfd_file_create(unsigned int count, int flags) > > kref_init(&ctx->kref); > init_waitqueue_head(&ctx->wqh); > - ctx->count = count; > + if (flags & EFD_MASK) { > + ctx->mask.events = 0; > + ctx->mask.data = 0; > + } else { > + ctx->count = count; > + } > ctx->flags = flags; > > file = anon_inode_getfile("[eventfd]", &eventfd_fops, ctx, > diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h > index 3c3ef19..b806d2b 100644 > --- a/include/linux/eventfd.h > +++ b/include/linux/eventfd.h > @@ -20,11 +20,12 @@ > * shared O_* flags. > */ > #define EFD_SEMAPHORE (1 << 0) > +#define EFD_MASK (1 << 1) Does this addition comply with the "CAREFUL:" immediately above it? It would be best to add code comemntary describing what this constant does. > #define EFD_CLOEXEC O_CLOEXEC > #define EFD_NONBLOCK O_NONBLOCK > > #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK) > -#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE) > +#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_MASK) > > #ifdef CONFIG_EVENTFD -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html