Re: [PATCH v2 1/1] eventfd: implementation of EFD_MASK flag

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri,  8 Feb 2013 09:11:17 +0100
Martin Sustrik <sustrik@xxxxxxxxxx> wrote:

> When implementing network protocols in user space, one has to implement
> fake user-space file descriptors to represent the sockets for the protocol.
> 
> While all the BSD socket API functionality for such descriptors may be faked as
> well (myproto_send(), myproto_recv() etc.) this approach doesn't work for
> polling  (select, poll, epoll). And unfortunately, sockets that can't be polled
> on allow only for building the simplest possible applications. Basically, you
> can build a simple client, but once you want to implement a server handling
> many sockets in parallel, you are stuck.
> 
> However, to do polling, real system-level file descriptor is needed,
> not a fake one.
> 
> In theory, eventfd may be used for this purpose, except that it is well suited
> only for signaling POLLIN. With some hacking it can be also used to signal
> POLLOUT and POLLERR, but:
> 
> I.  There's no way to signal POLLPRI, POLLHUP etc.
> II. There's no way to signal arbitraty combination of POLL* flags. Most notably,
>     !POLLIN & !POLLOUT, which is a perfectly valid combination for a network
>     protocol (rx buffer is empty and tx buffer is full), cannot be signaled
>     using current implementation of eventfd.
> 
> This patch implements new EFD_MASK flag which attempts to solve this problem.
> 
> Additionally, when implementing network protocols in user space, there's a
> need to associate user-space state with the each "socket". If eventfd object is
> used as a reference to the socket, it should be possible to associate an opaque
> pointer to user-space data with it.
> 
> The semantics of EFD_MASK are as follows:
> 
> eventfd(2):
> 
> If eventfd is created with EFD_MASK flag set, it is initialised in such a way
> as to signal no events on the file descriptor when it is polled on. 'initval'
> argument is ignored.
> 
> write(2):
> 
> User is allowed to write only buffers containing the following structure:
> 
> struct efd_mask {
>   short events;
>   union {
>     void *ptr;
>     uint32_t u32;
>     uint64_t u64;
>   };
> };
> 
> The value of 'events' should be any combination of event flags as defined by
> poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified events will
> be signaled when polling (select, poll, epoll) on the eventfd is done later on.
> ptr, u32 and u63 are opaque data that are not interpreted by eventfd object.
> 
> read(2):
> 
> User is allowed to read an efd_mask structure from the eventfd marked by
> EFD_MASK. Returned value shall be the last one written to the eventfd.
> 
> select(2), poll(2) and similar:
> 
> When polling on the eventfd marked by EFD_MASK flag, all the events specified
> in last written 'events' field shall be signaled.
> 
> ...

This patch adds userspace interfaces which will require manpage
updates.  Please Cc Michael and work with him on getting those changes
completed.

>
> +/*  On x86-64 keep the same binary layout as on i386. */
> +#ifdef __x86_64__
> +#define EVENTFD_MASK_PACKED __packed
> +#else
> +#define EVENTFD_MASK_PACKED
> +#endif
> +
> +struct eventfd_mask {
> +	__u32 events;
> +	__u64 data;
> +} EVENTFD_MASK_PACKED;

The x86-64 specific thing is ugly.  I can find no explanation of why it
was done, but it should go away.  You could make `events' a u64, or
swap the order of the two fields and make the struct __packed on all
architectures.

Given that the size of the types is fixed, I see no compat issues here.

As this struct is known by userspace, this definition should appear in
a header which is available to usersapce: include/uapi/...

>  struct eventfd_ctx {
>  	struct kref kref;
>  	wait_queue_head_t wqh;
> -	/*
> -	 * Every time that a write(2) is performed on an eventfd, the
> -	 * value of the __u64 being written is added to "count" and a
> -	 * wakeup is performed on "wqh". A read(2) will return the "count"
> -	 * value to userspace, and will reset "count" to zero. The kernel
> -	 * side eventfd_signal() also, adds to the "count" counter and
> -	 * issue a wakeup.
> -	 */
> -	__u64 count;
> +	union {
> +		/*
> +		 * Every time that a write(2) is performed on an eventfd, the
> +		 * value of the __u64 being written is added to "count" and a
> +		 * wakeup is performed on "wqh". A read(2) will return the
> +		 * "count" value to userspace, and will reset "count" to zero.
> +		 * The kernel side eventfd_signal() also, adds to the "count"
> +		 * counter and issue a wakeup.
> +		 */
> +		__u64 count;
> +		struct eventfd_mask mask;

The nice explanation for `count' was retained, but is it appropriate
that `mask' have no explanation?

> +	};
>  	unsigned int flags;
>  };
>  
> ...
>
> @@ -230,13 +261,23 @@ static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
>  	ssize_t res;
>  	__u64 cnt;
>  
> -	if (count < sizeof(cnt))
> -		return -EINVAL;
> -	res = eventfd_ctx_read(ctx, file->f_flags & O_NONBLOCK, &cnt);
> -	if (res < 0)
> +	if (ctx->flags & EFD_MASK) {
> +		spin_lock_irq(&ctx->wqh.lock);
> +		if (count < sizeof(ctx->mask))
> +			return -EINVAL;
> +		res = copy_to_user(buf, &ctx->mask, sizeof(ctx->mask)) ?
> +			-EFAULT : sizeof(ctx->mask);
> +		spin_unlock_irq(&ctx->wqh.lock);
>  		return res;

This code is crawling with bugs.

- can return with wqh.lock held -> dead kernel

- performs copy_to_user() under spinlock -> warning spew, kernel
  deadlocks.  It should go via a local temporary, as was done in
  eventfd_write().

This should have filled your screen with warnings when testing.  Either
it wasn't tested or its author forgot to read
Documentation/SubmitChecklist section 12.  Please do so ;)

(otoh maybe might_sleep and lockdep fail to detect copy_*_user under
spinlock when the copy doesn't fault.  If so, that's a big fail)

> -
> -	return put_user(cnt, (__u64 __user *) buf) ? -EFAULT : sizeof(cnt);
> +	} else {
> +		if (count < sizeof(cnt))
> +			return -EINVAL;
> +		res = eventfd_ctx_read(ctx, file->f_flags & O_NONBLOCK, &cnt);
> +		if (res < 0)
> +			return res;
> +		return put_user(cnt, (__u64 __user *) buf) ?
> +			-EFAULT : sizeof(cnt);
> +	}
>  }
>  
>  static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t count,
> @@ -246,6 +287,23 @@ static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t c
>  	ssize_t res;
>  	__u64 ucnt;
>  	DECLARE_WAITQUEUE(wait, current);
> +	struct eventfd_mask mask;
> +
> +	if (ctx->flags & EFD_MASK) {
> +		if (count < sizeof(mask))
> +			return -EINVAL;
> +		if (copy_from_user(&mask, buf, sizeof(mask)))
> +			return -EFAULT;
> +		if (mask.events & ~EFD_MASK_VALID_EVENTS)
> +			return -EINVAL;
> +		spin_lock_irq(&ctx->wqh.lock);
> +		memcpy(&ctx->mask, &mask, sizeof(ctx->mask));
> +		if (waitqueue_active(&ctx->wqh))
> +			wake_up_locked_poll(&ctx->wqh,
> +				(unsigned long)ctx->mask.events);
> +		spin_unlock_irq(&ctx->wqh.lock);
> +		return sizeof(ctx->mask);
> +	}

`mask' can be made local to this `if' block, which is nicer.

>  	if (count < sizeof(ucnt))
>  		return -EINVAL;
> @@ -293,8 +351,13 @@ static int eventfd_show_fdinfo(struct seq_file *m, struct file *f)
>  	int ret;
>  
>  	spin_lock_irq(&ctx->wqh.lock);
> -	ret = seq_printf(m, "eventfd-count: %16llx\n",
> -			 (unsigned long long)ctx->count);
> +	if (ctx->flags & EFD_MASK) {
> +		ret = seq_printf(m, "eventfd-mask: %x\n",
> +				 (unsigned)ctx->mask.events);
> +	} else {
> +		ret = seq_printf(m, "eventfd-count: %16llx\n",
> +				 (unsigned long long)ctx->count);
> +	}
>  	spin_unlock_irq(&ctx->wqh.lock);

This is a non-back-compatible userspace interface change.  A procfs
file which previously displayed

	eventfd-count: nnnn

can now also display

	eventfd-mask: nnnn

So existing userspace could misbehave.

Please fully describe the proposed interface change in the changelog. 
That description should include the full pathname of the procfs file
and example before-and-after output and a discussion of whether and why
the risk to existing userspace is acceptable.
 
> @@ -412,7 +475,12 @@ struct file *eventfd_file_create(unsigned int count, int flags)
>  
>  	kref_init(&ctx->kref);
>  	init_waitqueue_head(&ctx->wqh);
> -	ctx->count = count;
> +	if (flags & EFD_MASK) {
> +		ctx->mask.events = 0;
> +		ctx->mask.data = 0;
> +	} else {
> +		ctx->count = count;
> +	}
>  	ctx->flags = flags;
>  
>  	file = anon_inode_getfile("[eventfd]", &eventfd_fops, ctx,
> diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
> index 3c3ef19..b806d2b 100644
> --- a/include/linux/eventfd.h
> +++ b/include/linux/eventfd.h
> @@ -20,11 +20,12 @@
>   * shared O_* flags.
>   */
>  #define EFD_SEMAPHORE (1 << 0)
> +#define EFD_MASK (1 << 1)

Does this addition comply with the "CAREFUL:" immediately above it?

It would be best to add code comemntary describing what this constant does.

>  #define EFD_CLOEXEC O_CLOEXEC
>  #define EFD_NONBLOCK O_NONBLOCK
>  
>  #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
> -#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
> +#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_MASK)
>  
>  #ifdef CONFIG_EVENTFD

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux