On Tue, Feb 12, 2019 at 11:00:18AM +0800, Yan, Zheng wrote: > On Fri, Jan 11, 2019 at 7:17 PM Luis Henriques <lhenriques@xxxxxxxx> wrote: > > > > Gregory Farnum <gfarnum@xxxxxxxxxx> writes: > > > > > On Mon, Jan 7, 2019 at 7:01 AM Luis Henriques <lhenriques@xxxxxxxx> wrote: > > >> > > >> Hi, > > >> > > >> While working on implementing copy_file_range(2) for the kernel CephFS > > >> client, I found an issue with truncated files that is described in [1]. > > >> The TL;DR is that, when executing a 'copy-from' OSD operation, both > > >> truncate_seq and truncate_size are copied from the base object into the > > >> target object. This, at least in the context of copy_file_range, doesn't > > >> make sense and will cause problems if, for example, the target file had > > >> previously been truncated, i.e. target.truncate_seq > base.truncate_seq > > >> (see test case in [1]). > > >> > > >> I've proposed a fix [2] but after discussing it with Gregory it sounds > > >> more like a hack than a real solution. Basically my patch simply adds a > > >> new flag to the 'copy-from' operation which a client can use so that > > >> truncate_{seq,size} aren't copied from the base object (and are *not* > > >> changed with the copy operation). > > >> > > >> Having my PR [2] tagged as 'pending-discussion', I decided to try to > > >> kick-off this discussion here in the mailing-list, maybe grabbing > > >> attention from other people with a deeper understanding of the OSD > > >> internals. > > >> > > >> Gregory's preferred solution would be to have the copy-from Op to allow > > >> to set truncate_seq and truncate_size values directly. Unfortunately, > > >> there seems to be no easy way of changing the interfaces to allow this > > >> to happen as the ceph_osd_op union (in rados.h) doesn't seem to be able > > >> to accommodate these 2 extra fields in copy_from. So, my initial > > >> questions would be: > > >> > > >> - What would be the options for extending copy-from to include this 2 > > >> extra fields? > > > > > > It just occurred to me — and let's be clear, this is a TERRIBLE hack — > > > that we might be able to handle this by generating compound > > > operations: the first op in sequence uses copy-from either as it > > > stands or in a changed one which doesn't copy the truncate values, and > > > the second writes the correct new truncate values. I'm not sure if > > > we've used compound ops with copy-from though so it'd take some > > > testing to make sure that works correctly and I've not forgotten some > > > new rule. > > > > > >> - I believe I understand the usage of truncate_{seq,size}, but it's > > >> really not clear to me whether there are any scenarios where we *do* > > >> want to modify truncate_{seq,size} while doing a copy-from. In the > > >> case of CephFS I don't think a copy-from will ever truncate a file, > > >> so the values could be left unchanged. But would the obvious solution > > >> of simply *never* copying these fields be a valid solution? (I > > >> suspect the answer is 'no' :-) > > > > > > This is all from memory and should be checked by someone, but the > > > truncate_seq and truncate_size are used approximately as follows: (I'm > > > definitely leaving out some of the manipulation for dealing with > > > multiple objects) > > > > > > 1) client sends truncate op to MDS > > > 2) MDS updates metadata and sends back new truncate_seq and > > > truncate_size to client, which saves them > > > 3) on a data op, the client sends the truncate_seq and truncate_size > > > to the OSD along with the read or write request. > > > 4) if the client has a newer truncate_seq than the OSD knows about, > > > the OSD will respect the provided truncate_size, which may mean > > > returning less data than it has, or performing a truncate on the > > > object itself. > > > > > > But now I know I'm forgetting something important here, because I > > > can't remember how we deal with existing truncate_seq = 1 > > > truncate_size = 10, then a truncate to 0 and a second truncate to 15 > > > (which results in the client providing truncate_seq 3 truncate_size 15 > > > to an OSD with a size 10 object that should be ignored). > > > > > > But leaving aside that issue, if we issue the OSD a copy-from op while > > > we have a newer truncate-seq, it needs to update to our truncate-seq. > > > Because we just gave it new data which it should return to any > > > follow-on readers! > > > > Aha! I was missing this obvious scenario. /me facepalms > > Of course we _need_ to update truncate_seq in this Op! > > > > > I am now wondering if maybe we're okay just having a second copy-from > > > op that DOES fill in values the client is already sending — it looks > > > like the existing copy-from op got bug-fixed to copy the truncate_seq > > > in .94.4 because that's necessary for a CephFS cache pool to function, > > > but obviously in this case we need different behavior. > > > > If I understand you correctly, you're saying that a new struct should be > > added to the union in ceph_osd_op, something like: > > > > struct { > > __le64 snapid; > > __le64 src_version; > > __u8 flags; > > __le32 src_fadvise_flags; > > __le64 truncate_size; > > __le32 truncate_seq; > > } __attribute__ ((packed)) copy_from2; > > > > But this will cause problems, because it's increasing the union size, > > right? > > > > we have already encodeed src_name and src_oloc into osd_op.indata. I > think we can add more fields to it Hmm... Ok, this would allow us to have a new copy_from operation containing fewer fields (including the truncate_{size,seq}) and drop/move some of the current ones into osd_op.indata. Sounds like an interesting approach. I'll need to spend some time understanding how would this work. This approach also addresses Gregory's concerns because this would be a completely new Op. Obviously the client would need to be modified too, but those would be the easy bits :) Cheers, -- Luís