Re: OSD 'copy-from' operation and truncate_seq value

Luis Henriques <lhenriques@xxxxxxxx> · Tue, 15 Jan 2019 12:28:16 +0000

Gregory Farnum <gfarnum@xxxxxxxxxx> writes:

> On Fri, Jan 11, 2019 at 3:01 AM Luis Henriques <lhenriques@xxxxxxxx> wrote:
>>
>> Gregory Farnum <gfarnum@xxxxxxxxxx> writes:
>>
>> > On Mon, Jan 7, 2019 at 7:01 AM Luis Henriques <lhenriques@xxxxxxxx> wrote:
>> >>
>> >> Hi,
>> >>
>> >> While working on implementing copy_file_range(2) for the kernel CephFS
>> >> client, I found an issue with truncated files that is described in [1].
>> >> The TL;DR is that, when executing a 'copy-from' OSD operation, both
>> >> truncate_seq and truncate_size are copied from the base object into the
>> >> target object. This, at least in the context of copy_file_range, doesn't
>> >> make sense and will cause problems if, for example, the target file had
>> >> previously been truncated, i.e. target.truncate_seq > base.truncate_seq
>> >> (see test case in [1]).
>> >>
>> >> I've proposed a fix [2] but after discussing it with Gregory it sounds
>> >> more like a hack than a real solution.  Basically my patch simply adds a
>> >> new flag to the 'copy-from' operation which a client can use so that
>> >> truncate_{seq,size} aren't copied from the base object (and are *not*
>> >> changed with the copy operation).
>> >>
>> >> Having my PR [2] tagged as 'pending-discussion', I decided to try to
>> >> kick-off this discussion here in the mailing-list, maybe grabbing
>> >> attention from other people with a deeper understanding of the OSD
>> >> internals.
>> >>
>> >> Gregory's preferred solution would be to have the copy-from Op to allow
>> >> to set truncate_seq and truncate_size values directly.  Unfortunately,
>> >> there seems to be no easy way of changing the interfaces to allow this
>> >> to happen as the ceph_osd_op union (in rados.h) doesn't seem to be able
>> >> to accommodate these 2 extra fields in copy_from.  So, my initial
>> >> questions would be:
>> >>
>> >> - What would be the options for extending copy-from to include this 2
>> >>   extra fields?
>> >
>> > It just occurred to me — and let's be clear, this is a TERRIBLE hack —
>> > that we might be able to handle this by generating compound
>> > operations: the first op in sequence uses copy-from either as it
>> > stands or in a changed one which doesn't copy the truncate values, and
>> > the second writes the correct new truncate values. I'm not sure if
>> > we've used compound ops with copy-from though so it'd take some
>> > testing to make sure that works correctly and I've not forgotten some
>> > new rule.
>> >
>> >> - I believe I understand the usage of truncate_{seq,size}, but it's
>> >>   really not clear to me whether there are any scenarios where we *do*
>> >>   want to modify truncate_{seq,size} while doing a copy-from.  In the
>> >>   case of CephFS I don't think a copy-from will ever truncate a file,
>> >>   so the values could be left unchanged.  But would the obvious solution
>> >>   of simply *never* copying these fields be a valid solution?  (I
>> >>   suspect the answer is 'no' :-)
>> >
>> > This is all from memory and should be checked by someone, but the
>> > truncate_seq and truncate_size are used approximately as follows: (I'm
>> > definitely leaving out some of the manipulation for dealing with
>> > multiple objects)
>> >
>> > 1) client sends truncate op to MDS
>> > 2) MDS updates metadata and sends back new truncate_seq and
>> > truncate_size to client, which saves them
>> > 3) on a data op, the client sends the truncate_seq and truncate_size
>> > to the OSD along with the read or write request.
>> > 4) if the client has a newer truncate_seq than the OSD knows about,
>> > the OSD will respect the provided truncate_size, which may mean
>> > returning less data than it has, or performing a truncate on the
>> > object itself.
>> >
>> > But now I know I'm forgetting something important here, because I
>> > can't remember how we deal with existing truncate_seq = 1
>> > truncate_size = 10, then a truncate to 0 and a second truncate to 15
>> > (which results in the client providing truncate_seq 3 truncate_size 15
>> > to an OSD with a size 10 object that should be ignored).
>> >
>> > But leaving aside that issue, if we issue the OSD a copy-from op while
>> > we have a newer truncate-seq, it needs to update to our truncate-seq.
>> > Because we just gave it new data which it should return to any
>> > follow-on readers!
>>
>> Aha!  I was missing this obvious scenario.  /me facepalms
>> Of course we _need_ to update truncate_seq in this Op!
>>
>> > I am now wondering if maybe we're okay just having a second copy-from
>> > op that DOES fill in values the client is already sending — it looks
>> > like the existing copy-from op got bug-fixed to copy the truncate_seq
>> > in .94.4 because that's necessary for a CephFS cache pool to function,
>> > but obviously in this case we need different behavior.
>>
>> If I understand you correctly, you're saying that a new struct should be
>> added to the union in ceph_osd_op, something like:
>>
>>         struct {
>>                 __le64 snapid;
>>                 __le64 src_version;
>>                 __u8 flags;
>>                 __le32 src_fadvise_flags;
>>                 __le64 truncate_size;
>>                 __le32 truncate_seq;
>>         } __attribute__ ((packed)) copy_from2;
>>
>> But this will cause problems, because it's increasing the union size,
>> right?
>
> That is one thing I was discussing.
>
> However, the client already updates the truncate_size and truncate_seq
> on some (most? all?) write ops. (In userspace, Client::_write() calls
> filer->write_trunc() and passes in the truncate_seq and
> truncate_size). So that may actually not be an issue if we just create
> a copy_from2 that looks like the existing copy_from except it doesn't
> copy the truncate_seq. See
> https://github.com/ceph/ceph/commit/6f9ee7961eee9ee3a61a07cbe0d8d289ee98fa9a.

But in that case I don't see the advantage of creating a new copy_from
operation -- why not simply add an extra flag to the current operation,
as I proposed in my PR?

Cheers,
-- 
Luis

>
> Zheng, is this something you can advise on more carefully than I? :)
> -Greg
>
>>
>> > I really don't know the details of the implementation around this
>> > enough any more to say with certainty what's going on. Hopefully you
>> > can investigate or one of the more full-time CephFS people knows the
>> > answer? :)
>>
>> Having this issue sorted out is definitely something I would love to see
>> in nautilus.  But I'm also not comfortable with the OSD internals, and
>> figuring out all the details of dealing with a new Op seems like a
>> *huge* task to me.  I guess I could try, but it will take me a while :-/
>>
>> Cheers,
>> --
>> Luis
>>
>> > -Greg
>> >
>> >>
>> >> Another problem is that the client will also need to figure out if the
>> >> OSDs have this issue fixed so that it can decide whether to use the Op
>> >> or not.  My PR adds a new OSD_COPY_FILE_RANGE feature, overlapping
>> >> SERVER_NAUTILUS, but Greg's preference is to use the osdmap.  The kernel
>> >> client does have _some_ support for osdmap but I believe some extra bits
>> >> would be required to use this map in this case (although I would need to
>> >> look closer to figure out what and how to do that).
>> >>
>> >> Anyway, I'm looking for ideas on how to sort this out so that we can
>> >> have copy_file_range fixed in CephFS.
>> >>
>> >> [1] https://tracker.ceph.com/issues/37378
>> >> [2] https://github.com/ceph/ceph/pull/25374
>> >>
>> >> Cheers,
>> >> --
>> >> Luis
>> >
>