Re: [PATCH] inotify: Extend ioctl to allow to request id of new watch descriptor

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[Added CC to linux-api]

On Thu 08-02-18 18:07:37, Kirill Tkhai wrote:
> Watch descriptor is id of the watch created by inotify_add_watch().
> It is allocated in inotify_add_to_idr(), and takes the numbers
> starting from 1. Every new inotify watch obtains next available
> number (usually, old + 1), as served by idr_alloc_cyclic().
> 
> CRIU (Checkpoint/Restore In Userspace) project supports inotify
> files, and restores watched descriptors with the same numbers,
> they had before dump. Since there was no kernel support, we
> had to use cycle to add a watch with specific descriptor id:
> 
> 	while (1) {
> 		int wd;
> 
> 		wd = inotify_add_watch(inotify_fd, path, mask);
> 		if (wd < 0) {
> 			break;
> 		} else if (wd == desired_wd_id) {
> 			ret = 0;
> 			break;
> 		}
> 
> 		inotify_rm_watch(inotify_fd, wd);
> 	}
> 
> (You may find the actual code at the below link:
>  https://github.com/checkpoint-restore/criu/blob/v3.7/criu/fsnotify.c#L577)
> 
> The cycle is suboptiomal and very expensive, but since there is no better
> kernel support, it was the only way to restore that. Happily, we had met
> mostly descriptors with small id, and this approach had worked somehow.
> 
> But recent time containers with inotify with big watch descriptors
> begun to come, and this way stopped to work at all. When descriptor id
> is something about 0x34d71d6, the restoring process spins in busy loop
> for a long time, and the restore hungs and delay of migration from node
> to node could easily be watched.
> 
> This patch aims to solve this problem. It introduces new ioctl
> INOTIFY_IOC_SETNEXTWD, which allows to request the number of next created
> watch descriptor from userspace. It simply calls idr_set_cursor() primitive
> to populate idr::idr_next, so that next idr_alloc_cyclic() allocation
> will return this id, if it is not occupied. This is the way which is
> used to restore some other resources from userspace. For example,
> /proc/sys/kernel/ns_last_pid works the same for task pids.
> 
> The new code is under CONFIG_CHECKPOINT_RESTORE #define, so small system
> may exclude it. The only change in generic inotify part is idr_alloc_cyclic()
> end argument. We had 0 there, and idr subsystem replaced it with INT_MAX
> in idr_get_free(). So, the max possible id was INT_MAX (see idr_get_free()
> again).
> 
> Since I need INOTIFY_IDR_END to check ioctl's third argument, it's better
> it's defined as positive number. But when not-zero value is passed
> to idr_get_free(), this function decrements it. Also, idr_alloc_cyclic()
> defined @end as int argument. So, it's impossible to pass positive @end
> argument to idr_alloc_cyclic() to get INT_MAX id. And after this patch
> inotify watch descriptors ids will take numbers [1, INT_MAX-1], INT_MAX
> will be unavailable.
> 
> Signed-off-by: Kirill Tkhai <ktkhai@xxxxxxxxxxxxx>

I'm not thrilled by this but OTOH I also don't know about a better option.
The patch as such looks OK to me so unless someone objects I'll take it to
my tree in a few days.

								Honza


> ---
>  fs/notify/inotify/inotify_user.c |   19 ++++++++++++++++++-
>  include/uapi/linux/inotify.h     |    8 ++++++++
>  2 files changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
> index 5c29bf16814f..3c824e252c84 100644
> --- a/fs/notify/inotify/inotify_user.c
> +++ b/fs/notify/inotify/inotify_user.c
> @@ -44,6 +44,9 @@
>  
>  #include <asm/ioctls.h>
>  
> +#define INOTIFY_IDR_START	1
> +#define INOTIFY_IDR_END		S32_MAX
> +
>  /* configurable via /proc/sys/fs/inotify/ */
>  static int inotify_max_queued_events __read_mostly;
>  
> @@ -285,6 +288,7 @@ static int inotify_release(struct inode *ignored, struct file *file)
>  static long inotify_ioctl(struct file *file, unsigned int cmd,
>  			  unsigned long arg)
>  {
> +	struct inotify_group_private_data *data __maybe_unused;
>  	struct fsnotify_group *group;
>  	struct fsnotify_event *fsn_event;
>  	void __user *p;
> @@ -293,6 +297,7 @@ static long inotify_ioctl(struct file *file, unsigned int cmd,
>  
>  	group = file->private_data;
>  	p = (void __user *) arg;
> +	data = &group->inotify_data;
>  
>  	pr_debug("%s: group=%p cmd=%u\n", __func__, group, cmd);
>  
> @@ -307,6 +312,17 @@ static long inotify_ioctl(struct file *file, unsigned int cmd,
>  		spin_unlock(&group->notification_lock);
>  		ret = put_user(send_len, (int __user *) p);
>  		break;
> +#ifdef CONFIG_CHECKPOINT_RESTORE
> +	case INOTIFY_IOC_SETNEXTWD:
> +		ret = -EINVAL;
> +		if (arg >= INOTIFY_IDR_START && arg < INOTIFY_IDR_END) {
> +			spin_lock(&data->idr_lock);
> +			idr_set_cursor(&data->idr, (unsigned int)arg);
> +			spin_unlock(&data->idr_lock);
> +			ret = 0;
> +		}
> +		break;
> +#endif /* CONFIG_CHECKPOINT_RESTORE */
>  	}
>  
>  	return ret;
> @@ -349,7 +365,8 @@ static int inotify_add_to_idr(struct idr *idr, spinlock_t *idr_lock,
>  	idr_preload(GFP_KERNEL);
>  	spin_lock(idr_lock);
>  
> -	ret = idr_alloc_cyclic(idr, i_mark, 1, 0, GFP_NOWAIT);
> +	ret = idr_alloc_cyclic(idr, i_mark, INOTIFY_IDR_START,
> +			       INOTIFY_IDR_END, GFP_NOWAIT);
>  	if (ret >= 0) {
>  		/* we added the mark to the idr, take a reference */
>  		i_mark->wd = ret;
> diff --git a/include/uapi/linux/inotify.h b/include/uapi/linux/inotify.h
> index 5474461683db..245489342c14 100644
> --- a/include/uapi/linux/inotify.h
> +++ b/include/uapi/linux/inotify.h
> @@ -71,5 +71,13 @@ struct inotify_event {
>  #define IN_CLOEXEC O_CLOEXEC
>  #define IN_NONBLOCK O_NONBLOCK
>  
> +/*
> + * ioctl numbers: inotify uses 'I' prefix for all ioctls,
> + * except historical FIONREAD, which based on 'T'.
> + *
> + * INOTIFY_IOC_SETNEXTWD: set desired number of next created
> + * watch descriptor.
> + */
> +#define INOTIFY_IOC_SETNEXTWD	_IOW('I', 0, __s32)
>  
>  #endif /* _UAPI_LINUX_INOTIFY_H */
> 
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux