On Wed, Feb 5, 2020 at 12:00 PM Luis Henriques <lhenriques@xxxxxxxx> wrote: > > On Tue, Feb 04, 2020 at 07:06:36PM +0100, Ilya Dryomov wrote: > > On Tue, Feb 4, 2020 at 4:11 PM Luis Henriques <lhenriques@xxxxxxxx> wrote: > > > > > > On Tue, Feb 04, 2020 at 11:56:57AM +0100, Ilya Dryomov wrote: > > > ... > > > > > @@ -2108,21 +2118,40 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, > > > > > CEPH_OSD_OP_FLAG_FADVISE_DONTNEED, > > > > > dst_ci->i_truncate_seq, dst_ci->i_truncate_size, > > > > > CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ); > > > > > - if (err) { > > > > > - if (err == -EOPNOTSUPP) { > > > > > - src_fsc->have_copy_from2 = false; > > > > > - pr_notice("OSDs don't support 'copy-from2'; " > > > > > - "disabling copy_file_range\n"); > > > > > - } > > > > > + if (IS_ERR(req)) { > > > > > + err = PTR_ERR(req); > > > > > dout("ceph_osdc_copy_from returned %d\n", err); > > > > > + > > > > > + /* wait for all queued requests */ > > > > > + ceph_osdc_wait_requests(&osd_reqs, &reqs_complete); > > > > > + ret += reqs_complete * object_size; /* Update copied bytes */ > > > > > > > > Hi Luis, > > > > > > > > Looks like ret is still incremented unconditionally? What happens > > > > if there are three OSD requests on the list and the first fails but > > > > the second and the third succeed? As is, ceph_osdc_wait_requests() > > > > will return an error with reqs_complete set to 2... > > > > > > > > > if (!ret) > > > > > ret = err; > > > > > > > > ... and we will return 8M instead of an error. > > > > > > Right, my assumption was that if a request fails, all subsequent requests > > > would also fail. This would allow ret to be updated with the number of > > > successful requests (x object size), even if the OSDs replies were being > > > delivered in a different order. But from your comment I see that my > > > assumption is incorrect. > > > > > > In that case, when shall ret be updated with the number of bytes already > > > written? Only after a successful call to ceph_osdc_wait_requests()? > > > > I mentioned this in the previous email: you probably want to change > > ceph_osdc_wait_requests() so that the counter isn't incremented after > > an error is encountered. > > Sure, I've seen that comment. But it doesn't help either because it's not > guaranteed that we'll receive the replies from the OSDs in the same order > we've sent them. Stopping the counter when we get an error doesn't > provide us any reliable information (which means I can simply drop that > counter). The list is FIFO so even though replies may indeed arrive out of order, ceph_osdc_wait_requests() will process them in order. If you stop counting as soon as an error is encountered, you know for sure that requests 1 through $COUNTER were successful and can safely multiply it by object size. > > > > I.e. only after each throttling cycle, when we don't have any requests > > > pending completion? In this case, I can simply drop the extra > > > reqs_complete parameter to the ceph_osdc_wait_requests. > > > > > > In your example the right thing to do would be to simply return an error, > > > I guess. But then we're assuming that we're loosing space in the storage, > > > as we may have created objects that won't be reachable anymore. > > > > Well, that is what I'm getting at -- this needs a lot more > > consideration. How errors are dealt with, how file metadata is > > updated, when do we fall back to plain copy, etc. Generating stray > > objects is bad but way better than reporting that e.g. 0..12M were > > copied when only 0..4M and 8M..12M were actually copied, leaving > > the user one step away from data loss. One option is to revert to > > issuing copy-from requests serially when an error is encountered. > > Another option is to fall back to plain copy on any error. Or perhaps > > we just don't bother with the complexity of parallel copy-from requests > > at all... > > To be honest, I'm starting to lean towards this option. Reverting to > serializing requests or to plain copy on error will not necessarily > prevent the stray objects: > > - send a bunch of copy requests > - wait for them to complete > * 1 failed, the other 63 succeeded > - revert to serialized copies, repeating the previous 64 requests > * after a few copies, we get another failure (maybe on the same OSDs) > and abort, leaving behind some stray objects from the previous bulk > request Yeah, doing it serially makes the accounting a lot easier. If you issue any OSD requests before updating the size, stray objects are bound to happen -- that's why "how file metadata is updated" is one of the important considerations. Thanks, Ilya