Re: [RFC][PATCH] Make io_submit non-blocking

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Ankit,

On Tue, Jul 24, 2012 at 5:11 PM, Ankit Jain <jankit@xxxxxxx> wrote:
>
>
> Currently, io_submit tries to execute the io requests on the
> same thread, which could block because of various reaons (eg.
> allocation of disk blocks). So, essentially, io_submit ends
> up being a blocking call.
>

Ideally filesystem should take care of it e.g. by deferring such time
consuming allocations and return -EIOCBQUEUED immediately. But have
you seen such cases?

> With this patch, io_submit prepares all the kiocbs and then
> adds (kicks) them to ctx->run_list (kicked) in one go and then
> schedules the workqueue. The actual operations are not executed
> on io_submit's process context, so it can return very quickly.
>

With lots of application threads firing continuous IOs, workqueue
threads might become bottleneck and you might have to eventually
develop a priority scheduling. This workqueue was originally designed
for IO retries which is an error path, now consumers of workqueue
might easily increase by 100x.

> This run_list is processed either on a workqueue or in response to
> an io_getevents call. This utilizes the existing retry infrastructure.
>
> It uses override_creds/revert_creds to use the submitting process'
> credentials when processing the iocb request from the workqueue. This
> is required for proper support of quota and reserved block access.
>
> Currently, we use block plugging in io_submit, since most of the IO
> was being done there itself. This patch moves it to aio_kick_handler
> and aio_run_all_iocbs, where the IO gets submitted.
>
> All the tests were run with ext4.
>
> I tested the patch with fio
>  (fio rand-rw-disk.fio --max-jobs=2 --latency-log
>  --bandwidth-log)
>
> **Unpatched**
> read : io=102120KB, bw=618740 B/s, iops=151 , runt=169006msec
> slat (usec): min=275 , max=87560 , avg=6571.88, stdev=2799.57
>
> write: io=102680KB, bw=622133 B/s, iops=151 , runt=169006msec
> slat (usec): min=2 , max=196 , avg=24.66, stdev=20.35
>
> **Patched**
> read : io=102864KB, bw=504885 B/s, iops=123 , runt=208627msec
> slat (usec): min=0 , max=120 , avg= 1.65, stdev= 3.46
>
> write: io=101936KB, bw=500330 B/s, iops=122 , runt=208627msec
> slat (usec): min=0 , max=131 , avg= 1.85, stdev= 3.27
>
> From above, it can be seen that submit latencies improve a lot with the
> patch. The full fio results for the "old"(unpatched) and "new"(patched)
> cases are attached. Results with both ramdisk (*rd*) and disk attached,
> and also the corresponding fio files.
>
> Some variations I tried:
>
> 1. I tried to submit one iocb at a time (lock/unlock ctx->lock), and
> that had good performance for a regular disk, but when I tested with a
> ramdisk (to simulate very fast disk), performance was extremely bad.
> Submitting all the iocbs from an io_submit in one go, restored the
> performance (latencies+bandwidth).
>
> 2. I was earlier trying to use queue_delayed_work with 0 timeout, but
> that worsened the submit latencies a bit but improved bandwidth.
>
> 3. Also, I tried not using aio_queue_work from io_submit call, and instead
> depending on an already scheduled one or the iocbs being run when
> io_getevents gets called. This seemed to give improved perfomance. But
> does this constitute as change of api semantics?
>

I once have observed latency issues with aio_queue_work with lesser
number of threads when I was trying to resubmit IOs on a ramdisk, as
this function introduces a mandatory delay if nobody is waiting on
this iocb. The latencies were high but with large number of threads,
effect was not prominent.

> Signed-off-by: Ankit Jain <jankit@xxxxxxx>
>
> --
> diff --git a/fs/aio.c b/fs/aio.c
> index 71f613c..79801096b 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -563,6 +563,11 @@ static int __aio_put_req(struct kioctx *ctx, struct
> kiocb *req)
>         req->ki_cancel = NULL;
>         req->ki_retry = NULL;
>
> +       if (likely(req->submitter_cred)) {
> +               put_cred(req->submitter_cred);
> +               req->submitter_cred = NULL;
> +       }
> +
>         fput(req->ki_filp);
>         req->ki_filp = NULL;
>         really_put_req(ctx, req);
> @@ -659,6 +664,7 @@ static ssize_t aio_run_iocb(struct kiocb *iocb)
>         struct kioctx   *ctx = iocb->ki_ctx;
>         ssize_t (*retry)(struct kiocb *);
>         ssize_t ret;
> +       const struct cred *old_cred = NULL;
>
>         if (!(retry = iocb->ki_retry)) {
>                 printk("aio_run_iocb: iocb->ki_retry = NULL\n");
> @@ -703,12 +709,19 @@ static ssize_t aio_run_iocb(struct kiocb *iocb)
>                 goto out;
>         }
>
> +       if (iocb->submitter_cred)
> +               /* setup creds */
> +               old_cred = override_creds(iocb->submitter_cred);
> +
>         /*
>          * Now we are all set to call the retry method in async
>          * context.
>          */
>         ret = retry(iocb);
>
> +       if (old_cred)
> +               revert_creds(old_cred);
> +
>         if (ret != -EIOCBRETRY && ret != -EIOCBQUEUED) {
>                 /*
>                  * There's no easy way to restart the syscall since other
> AIO's
> @@ -804,10 +817,14 @@ static void aio_queue_work(struct kioctx * ctx)
>   */
>  static inline void aio_run_all_iocbs(struct kioctx *ctx)
>  {
> +       struct blk_plug plug;
> +
> +       blk_start_plug(&plug);
>         spin_lock_irq(&ctx->ctx_lock);
>         while (__aio_run_iocbs(ctx))
>                 ;
>         spin_unlock_irq(&ctx->ctx_lock);
> +       blk_finish_plug(&plug);
>  }
>
>  /*
> @@ -825,13 +842,16 @@ static void aio_kick_handler(struct work_struct
> *work)
>         mm_segment_t oldfs = get_fs();
>         struct mm_struct *mm;
>         int requeue;
> +       struct blk_plug plug;
>
>         set_fs(USER_DS);
>         use_mm(ctx->mm);
> +       blk_start_plug(&plug);
>         spin_lock_irq(&ctx->ctx_lock);
>         requeue =__aio_run_iocbs(ctx);
>         mm = ctx->mm;
>         spin_unlock_irq(&ctx->ctx_lock);
> +       blk_finish_plug(&plug);
>         unuse_mm(mm);
>         set_fs(oldfs);
>         /*
> @@ -1506,12 +1526,14 @@ static ssize_t aio_setup_iocb(struct kiocb *kiocb,
> bool compat)
>
>  static int io_submit_one(struct kioctx *ctx, struct iocb __user
> *user_iocb,
>                          struct iocb *iocb, struct kiocb_batch *batch,
> -                        bool compat)
> +                        bool compat, struct kiocb **req_entry)
>  {
>         struct kiocb *req;
>         struct file *file;
>         ssize_t ret;
>
> +       *req_entry = NULL;
> +
>         /* enforce forwards compatibility on users */
>         if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2)) {
>                 pr_debug("EINVAL: io_submit: reserve field set\n");
> @@ -1537,6 +1559,7 @@ static int io_submit_one(struct kioctx *ctx, struct
> iocb __user *user_iocb,
>                 fput(file);
>                 return -EAGAIN;
>         }
> +
>         req->ki_filp = file;
>         if (iocb->aio_flags & IOCB_FLAG_RESFD) {
>                 /*
> @@ -1567,38 +1590,16 @@ static int io_submit_one(struct kioctx *ctx,
> struct iocb __user *user_iocb,
>         req->ki_left = req->ki_nbytes = iocb->aio_nbytes;
>         req->ki_opcode = iocb->aio_lio_opcode;
>
> +       req->submitter_cred = get_current_cred();
> +
>         ret = aio_setup_iocb(req, compat);
>
>         if (ret)
>                 goto out_put_req;
>
> -       spin_lock_irq(&ctx->ctx_lock);
> -       /*
> -        * We could have raced with io_destroy() and are currently holding
> a
> -        * reference to ctx which should be destroyed. We cannot submit IO
> -        * since ctx gets freed as soon as io_submit() puts its reference.
> The
> -        * check here is reliable: io_destroy() sets ctx->dead before
> waiting
> -        * for outstanding IO and the barrier between these two is
> realized by
> -        * unlock of mm->ioctx_lock and lock of ctx->ctx_lock.
> Analogously we
> -        * increment ctx->reqs_active before checking for ctx->dead and
> the
> -        * barrier is realized by unlock and lock of ctx->ctx_lock. Thus
> if we
> -        * don't see ctx->dead set here, io_destroy() waits for our IO to
> -        * finish.
> -        */
> -       if (ctx->dead) {
> -               spin_unlock_irq(&ctx->ctx_lock);
> -               ret = -EINVAL;
> -               goto out_put_req;
> -       }
> -       aio_run_iocb(req);
> -       if (!list_empty(&ctx->run_list)) {
> -               /* drain the run list */
> -               while (__aio_run_iocbs(ctx))
> -                       ;
> -       }
> -       spin_unlock_irq(&ctx->ctx_lock);
> -
>         aio_put_req(req);       /* drop extra ref to req */
> +
> +       *req_entry = req;
>         return 0;
>
>  out_put_req:
> @@ -1613,8 +1614,10 @@ long do_io_submit(aio_context_t ctx_id, long nr,
>         struct kioctx *ctx;
>         long ret = 0;
>         int i = 0;
> -       struct blk_plug plug;
>         struct kiocb_batch batch;
> +       struct kiocb **req_arr = NULL;
> +       int nr_submitted = 0;
> +       int req_arr_cnt = 0;
>
>         if (unlikely(nr < 0))
>                 return -EINVAL;
> @@ -1632,8 +1635,8 @@ long do_io_submit(aio_context_t ctx_id, long nr,
>         }
>
>         kiocb_batch_init(&batch, nr);
> -
> -       blk_start_plug(&plug);
> +       req_arr = kmalloc(sizeof(struct kiocb *) * nr, GFP_KERNEL);
> +       memset(req_arr, 0, sizeof(req_arr));
>
>         /*
>          * AKPM: should this return a partial result if some of the IOs
> were
> @@ -1653,15 +1656,51 @@ long do_io_submit(aio_context_t ctx_id, long nr,
>                         break;
>                 }
>
> -               ret = io_submit_one(ctx, user_iocb, &tmp, &batch, compat);
> +               ret = io_submit_one(ctx, user_iocb, &tmp, &batch, compat,
> +                               &req_arr[i]);
>                 if (ret)
>                         break;
> +               req_arr_cnt++;
>         }
> -       blk_finish_plug(&plug);
>
> +       spin_lock_irq(&ctx->ctx_lock);
> +       /*
> +        * We could have raced with io_destroy() and are currently holding
> a
> +        * reference to ctx which should be destroyed. We cannot submit IO
> +        * since ctx gets freed as soon as io_submit() puts its reference.
> The
> +        * check here is reliable: io_destroy() sets ctx->dead before
> waiting
> +        * for outstanding IO and the barrier between these two is
> realized by
> +        * unlock of mm->ioctx_lock and lock of ctx->ctx_lock.
> Analogously we
> +        * increment ctx->reqs_active before checking for ctx->dead and
> the
> +        * barrier is realized by unlock and lock of ctx->ctx_lock. Thus
> if we
> +        * don't see ctx->dead set here, io_destroy() waits for our IO to
> +        * finish.
> +        */
> +       if (ctx->dead) {
> +               spin_unlock_irq(&ctx->ctx_lock);
> +               for (i = 0; i < req_arr_cnt; i++)
> +                       /* drop i/o ref to the req */
> +                       __aio_put_req(ctx, req_arr[i]);
> +
> +               ret = -EINVAL;
> +               goto out;
> +       }
> +
> +       for (i = 0; i < req_arr_cnt; i++) {
> +               struct kiocb *req = req_arr[i];
> +               if (likely(!kiocbTryKick(req)))
> +                       __queue_kicked_iocb(req);
> +               nr_submitted++;
> +       }
> +       if (likely(nr_submitted > 0))
> +               aio_queue_work(ctx);
> +       spin_unlock_irq(&ctx->ctx_lock);
> +
> +out:
>         kiocb_batch_free(ctx, &batch);
> +       kfree(req_arr);
>         put_ioctx(ctx);
> -       return i ? i : ret;
> +       return nr_submitted ? nr_submitted : ret;
>  }
>
>  /* sys_io_submit:
> diff --git a/include/linux/aio.h b/include/linux/aio.h
> index b1a520e..bcd6a5e 100644
> --- a/include/linux/aio.h
> +++ b/include/linux/aio.h
> @@ -124,6 +124,8 @@ struct kiocb {
>          * this is the underlying eventfd context to deliver events to.
>          */
>         struct eventfd_ctx      *ki_eventfd;
> +
> +       const struct cred       *submitter_cred;
>  };
>
>  #define is_sync_kiocb(iocb)    ((iocb)->ki_key == KIOCB_SYNC_KEY)
>
> --
> Ankit Jain
> SUSE Labs

--
Rajat Sharma
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux