Re: [PATCH v3 1/1] ceph: parallelize all copy-from requests in copy_file_range

Ilya Dryomov <idryomov@xxxxxxxxx> · Wed, 5 Feb 2020 13:01:52 +0100

On Wed, Feb 5, 2020 at 12:00 PM Luis Henriques <lhenriques@xxxxxxxx> wrote:
>
> On Tue, Feb 04, 2020 at 07:06:36PM +0100, Ilya Dryomov wrote:
> > On Tue, Feb 4, 2020 at 4:11 PM Luis Henriques <lhenriques@xxxxxxxx> wrote:
> > >
> > > On Tue, Feb 04, 2020 at 11:56:57AM +0100, Ilya Dryomov wrote:
> > > ...
> > > > > @@ -2108,21 +2118,40 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > > > >                         CEPH_OSD_OP_FLAG_FADVISE_DONTNEED,
> > > > >                         dst_ci->i_truncate_seq, dst_ci->i_truncate_size,
> > > > >                         CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ);
> > > > > -               if (err) {
> > > > > -                       if (err == -EOPNOTSUPP) {
> > > > > -                               src_fsc->have_copy_from2 = false;
> > > > > -                               pr_notice("OSDs don't support 'copy-from2'; "
> > > > > -                                         "disabling copy_file_range\n");
> > > > > -                       }
> > > > > +               if (IS_ERR(req)) {
> > > > > +                       err = PTR_ERR(req);
> > > > >                         dout("ceph_osdc_copy_from returned %d\n", err);
> > > > > +
> > > > > +                       /* wait for all queued requests */
> > > > > +                       ceph_osdc_wait_requests(&osd_reqs, &reqs_complete);
> > > > > +                       ret += reqs_complete * object_size; /* Update copied bytes */
> > > >
> > > > Hi Luis,
> > > >
> > > > Looks like ret is still incremented unconditionally?  What happens
> > > > if there are three OSD requests on the list and the first fails but
> > > > the second and the third succeed?  As is, ceph_osdc_wait_requests()
> > > > will return an error with reqs_complete set to 2...
> > > >
> > > > >                         if (!ret)
> > > > >                                 ret = err;
> > > >
> > > > ... and we will return 8M instead of an error.
> > >
> > > Right, my assumption was that if a request fails, all subsequent requests
> > > would also fail.  This would allow ret to be updated with the number of
> > > successful requests (x object size), even if the OSDs replies were being
> > > delivered in a different order.  But from your comment I see that my
> > > assumption is incorrect.
> > >
> > > In that case, when shall ret be updated with the number of bytes already
> > > written?  Only after a successful call to ceph_osdc_wait_requests()?
> >
> > I mentioned this in the previous email: you probably want to change
> > ceph_osdc_wait_requests() so that the counter isn't incremented after
> > an error is encountered.
>
> Sure, I've seen that comment.  But it doesn't help either because it's not
> guaranteed that we'll receive the replies from the OSDs in the same order
> we've sent them.  Stopping the counter when we get an error doesn't
> provide us any reliable information (which means I can simply drop that
> counter).

The list is FIFO so even though replies may indeed arrive out of
order, ceph_osdc_wait_requests() will process them in order.  If you
stop counting as soon as an error is encountered, you know for sure
that requests 1 through $COUNTER were successful and can safely
multiply it by object size.

>
> > > I.e. only after each throttling cycle, when we don't have any requests
> > > pending completion?  In this case, I can simply drop the extra
> > > reqs_complete parameter to the ceph_osdc_wait_requests.
> > >
> > > In your example the right thing to do would be to simply return an error,
> > > I guess.  But then we're assuming that we're loosing space in the storage,
> > > as we may have created objects that won't be reachable anymore.
> >
> > Well, that is what I'm getting at -- this needs a lot more
> > consideration.  How errors are dealt with, how file metadata is
> > updated, when do we fall back to plain copy, etc.  Generating stray
> > objects is bad but way better than reporting that e.g. 0..12M were
> > copied when only 0..4M and 8M..12M were actually copied, leaving
> > the user one step away from data loss.  One option is to revert to
> > issuing copy-from requests serially when an error is encountered.
> > Another option is to fall back to plain copy on any error.  Or perhaps
> > we just don't bother with the complexity of parallel copy-from requests
> > at all...
>
> To be honest, I'm starting to lean towards this option.  Reverting to
> serializing requests or to plain copy on error will not necessarily
> prevent the stray objects:
>
>   - send a bunch of copy requests
>   - wait for them to complete
>      * 1 failed, the other 63 succeeded
>   - revert to serialized copies, repeating the previous 64 requests
>      * after a few copies, we get another failure (maybe on the same OSDs)
>        and abort, leaving behind some stray objects from the previous bulk
>        request

Yeah, doing it serially makes the accounting a lot easier.  If you
issue any OSD requests before updating the size, stray objects are
bound to happen -- that's why "how file metadata is updated" is one
of the important considerations.

Thanks,

                Ilya