Re: OSD 'copy-from' operation and truncate_seq value

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



(Sorry for the delay here, but let's bring it back!)

On Tue, Jan 15, 2019 at 4:27 AM Luis Henriques <lhenriques@xxxxxxxx> wrote:
>
> Gregory Farnum <gfarnum@xxxxxxxxxx> writes:
>
> > On Fri, Jan 11, 2019 at 3:01 AM Luis Henriques <lhenriques@xxxxxxxx> wrote:
> >>
> >> Gregory Farnum <gfarnum@xxxxxxxxxx> writes:
> >>
> >> > On Mon, Jan 7, 2019 at 7:01 AM Luis Henriques <lhenriques@xxxxxxxx> wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> While working on implementing copy_file_range(2) for the kernel CephFS
> >> >> client, I found an issue with truncated files that is described in [1].
> >> >> The TL;DR is that, when executing a 'copy-from' OSD operation, both
> >> >> truncate_seq and truncate_size are copied from the base object into the
> >> >> target object. This, at least in the context of copy_file_range, doesn't
> >> >> make sense and will cause problems if, for example, the target file had
> >> >> previously been truncated, i.e. target.truncate_seq > base.truncate_seq
> >> >> (see test case in [1]).
> >> >>
> >> >> I've proposed a fix [2] but after discussing it with Gregory it sounds
> >> >> more like a hack than a real solution.  Basically my patch simply adds a
> >> >> new flag to the 'copy-from' operation which a client can use so that
> >> >> truncate_{seq,size} aren't copied from the base object (and are *not*
> >> >> changed with the copy operation).
> >> >>
> >> >> Having my PR [2] tagged as 'pending-discussion', I decided to try to
> >> >> kick-off this discussion here in the mailing-list, maybe grabbing
> >> >> attention from other people with a deeper understanding of the OSD
> >> >> internals.
> >> >>
> >> >> Gregory's preferred solution would be to have the copy-from Op to allow
> >> >> to set truncate_seq and truncate_size values directly.  Unfortunately,
> >> >> there seems to be no easy way of changing the interfaces to allow this
> >> >> to happen as the ceph_osd_op union (in rados.h) doesn't seem to be able
> >> >> to accommodate these 2 extra fields in copy_from.  So, my initial
> >> >> questions would be:
> >> >>
> >> >> - What would be the options for extending copy-from to include this 2
> >> >>   extra fields?
> >> >
> >> > It just occurred to me — and let's be clear, this is a TERRIBLE hack —
> >> > that we might be able to handle this by generating compound
> >> > operations: the first op in sequence uses copy-from either as it
> >> > stands or in a changed one which doesn't copy the truncate values, and
> >> > the second writes the correct new truncate values. I'm not sure if
> >> > we've used compound ops with copy-from though so it'd take some
> >> > testing to make sure that works correctly and I've not forgotten some
> >> > new rule.
> >> >
> >> >> - I believe I understand the usage of truncate_{seq,size}, but it's
> >> >>   really not clear to me whether there are any scenarios where we *do*
> >> >>   want to modify truncate_{seq,size} while doing a copy-from.  In the
> >> >>   case of CephFS I don't think a copy-from will ever truncate a file,
> >> >>   so the values could be left unchanged.  But would the obvious solution
> >> >>   of simply *never* copying these fields be a valid solution?  (I
> >> >>   suspect the answer is 'no' :-)
> >> >
> >> > This is all from memory and should be checked by someone, but the
> >> > truncate_seq and truncate_size are used approximately as follows: (I'm
> >> > definitely leaving out some of the manipulation for dealing with
> >> > multiple objects)
> >> >
> >> > 1) client sends truncate op to MDS
> >> > 2) MDS updates metadata and sends back new truncate_seq and
> >> > truncate_size to client, which saves them
> >> > 3) on a data op, the client sends the truncate_seq and truncate_size
> >> > to the OSD along with the read or write request.
> >> > 4) if the client has a newer truncate_seq than the OSD knows about,
> >> > the OSD will respect the provided truncate_size, which may mean
> >> > returning less data than it has, or performing a truncate on the
> >> > object itself.
> >> >
> >> > But now I know I'm forgetting something important here, because I
> >> > can't remember how we deal with existing truncate_seq = 1
> >> > truncate_size = 10, then a truncate to 0 and a second truncate to 15
> >> > (which results in the client providing truncate_seq 3 truncate_size 15
> >> > to an OSD with a size 10 object that should be ignored).
> >> >
> >> > But leaving aside that issue, if we issue the OSD a copy-from op while
> >> > we have a newer truncate-seq, it needs to update to our truncate-seq.
> >> > Because we just gave it new data which it should return to any
> >> > follow-on readers!
> >>
> >> Aha!  I was missing this obvious scenario.  /me facepalms
> >> Of course we _need_ to update truncate_seq in this Op!
> >>
> >> > I am now wondering if maybe we're okay just having a second copy-from
> >> > op that DOES fill in values the client is already sending — it looks
> >> > like the existing copy-from op got bug-fixed to copy the truncate_seq
> >> > in .94.4 because that's necessary for a CephFS cache pool to function,
> >> > but obviously in this case we need different behavior.
> >>
> >> If I understand you correctly, you're saying that a new struct should be
> >> added to the union in ceph_osd_op, something like:
> >>
> >>         struct {
> >>                 __le64 snapid;
> >>                 __le64 src_version;
> >>                 __u8 flags;
> >>                 __le32 src_fadvise_flags;
> >>                 __le64 truncate_size;
> >>                 __le32 truncate_seq;
> >>         } __attribute__ ((packed)) copy_from2;
> >>
> >> But this will cause problems, because it's increasing the union size,
> >> right?
> >
> > That is one thing I was discussing.
> >
> > However, the client already updates the truncate_size and truncate_seq
> > on some (most? all?) write ops. (In userspace, Client::_write() calls
> > filer->write_trunc() and passes in the truncate_seq and
> > truncate_size). So that may actually not be an issue if we just create
> > a copy_from2 that looks like the existing copy_from except it doesn't
> > copy the truncate_seq. See
> > https://github.com/ceph/ceph/commit/6f9ee7961eee9ee3a61a07cbe0d8d289ee98fa9a.
>
> But in that case I don't see the advantage of creating a new copy_from
> operation -- why not simply add an extra flag to the current operation,
> as I proposed in my PR?

I think it would function, yes. My main concern would just be that
it's a lot easier to accidentally sneak through an "invalid" op
request if it's the same OP ID and a flag that the OSD needs to look
for, versus a new op type that specifies the different behavior.

Mostly though I'm concerned about any potential differences in what
the client needs that I'm missing from this level of analysis... :/
-Greg



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux