Re: OSD 'copy-from' operation and truncate_seq value

Gregory Farnum <gfarnum@xxxxxxxxxx> · Tue, 8 Jan 2019 21:46:32 -0800

On Mon, Jan 7, 2019 at 7:01 AM Luis Henriques <lhenriques@xxxxxxxx> wrote:
>
> Hi,
>
> While working on implementing copy_file_range(2) for the kernel CephFS
> client, I found an issue with truncated files that is described in [1].
> The TL;DR is that, when executing a 'copy-from' OSD operation, both
> truncate_seq and truncate_size are copied from the base object into the
> target object. This, at least in the context of copy_file_range, doesn't
> make sense and will cause problems if, for example, the target file had
> previously been truncated, i.e. target.truncate_seq > base.truncate_seq
> (see test case in [1]).
>
> I've proposed a fix [2] but after discussing it with Gregory it sounds
> more like a hack than a real solution.  Basically my patch simply adds a
> new flag to the 'copy-from' operation which a client can use so that
> truncate_{seq,size} aren't copied from the base object (and are *not*
> changed with the copy operation).
>
> Having my PR [2] tagged as 'pending-discussion', I decided to try to
> kick-off this discussion here in the mailing-list, maybe grabbing
> attention from other people with a deeper understanding of the OSD
> internals.
>
> Gregory's preferred solution would be to have the copy-from Op to allow
> to set truncate_seq and truncate_size values directly.  Unfortunately,
> there seems to be no easy way of changing the interfaces to allow this
> to happen as the ceph_osd_op union (in rados.h) doesn't seem to be able
> to accommodate these 2 extra fields in copy_from.  So, my initial
> questions would be:
>
> - What would be the options for extending copy-from to include this 2
>   extra fields?

It just occurred to me — and let's be clear, this is a TERRIBLE hack —
that we might be able to handle this by generating compound
operations: the first op in sequence uses copy-from either as it
stands or in a changed one which doesn't copy the truncate values, and
the second writes the correct new truncate values. I'm not sure if
we've used compound ops with copy-from though so it'd take some
testing to make sure that works correctly and I've not forgotten some
new rule.

> - I believe I understand the usage of truncate_{seq,size}, but it's
>   really not clear to me whether there are any scenarios where we *do*
>   want to modify truncate_{seq,size} while doing a copy-from.  In the
>   case of CephFS I don't think a copy-from will ever truncate a file,
>   so the values could be left unchanged.  But would the obvious solution
>   of simply *never* copying these fields be a valid solution?  (I
>   suspect the answer is 'no' :-)

This is all from memory and should be checked by someone, but the
truncate_seq and truncate_size are used approximately as follows: (I'm
definitely leaving out some of the manipulation for dealing with
multiple objects)

1) client sends truncate op to MDS
2) MDS updates metadata and sends back new truncate_seq and
truncate_size to client, which saves them
3) on a data op, the client sends the truncate_seq and truncate_size
to the OSD along with the read or write request.
4) if the client has a newer truncate_seq than the OSD knows about,
the OSD will respect the provided truncate_size, which may mean
returning less data than it has, or performing a truncate on the
object itself.

But now I know I'm forgetting something important here, because I
can't remember how we deal with existing truncate_seq = 1
truncate_size = 10, then a truncate to 0 and a second truncate to 15
(which results in the client providing truncate_seq 3 truncate_size 15
to an OSD with a size 10 object that should be ignored).

But leaving aside that issue, if we issue the OSD a copy-from op while
we have a newer truncate-seq, it needs to update to our truncate-seq.
Because we just gave it new data which it should return to any
follow-on readers!
I am now wondering if maybe we're okay just having a second copy-from
op that DOES fill in values the client is already sending — it looks
like the existing copy-from op got bug-fixed to copy the truncate_seq
in .94.4 because that's necessary for a CephFS cache pool to function,
but obviously in this case we need different behavior.
I really don't know the details of the implementation around this
enough any more to say with certainty what's going on. Hopefully you
can investigate or one of the more full-time CephFS people knows the
answer? :)
-Greg

>
> Another problem is that the client will also need to figure out if the
> OSDs have this issue fixed so that it can decide whether to use the Op
> or not.  My PR adds a new OSD_COPY_FILE_RANGE feature, overlapping
> SERVER_NAUTILUS, but Greg's preference is to use the osdmap.  The kernel
> client does have _some_ support for osdmap but I believe some extra bits
> would be required to use this map in this case (although I would need to
> look closer to figure out what and how to do that).
>
> Anyway, I'm looking for ideas on how to sort this out so that we can
> have copy_file_range fixed in CephFS.
>
> [1] https://tracker.ceph.com/issues/37378
> [2] https://github.com/ceph/ceph/pull/25374
>
> Cheers,
> --
> Luis