Gregory Farnum <gfarnum@xxxxxxxxxx> writes: > On Fri, Jan 11, 2019 at 3:01 AM Luis Henriques <lhenriques@xxxxxxxx> wrote: >> >> Gregory Farnum <gfarnum@xxxxxxxxxx> writes: >> >> > On Mon, Jan 7, 2019 at 7:01 AM Luis Henriques <lhenriques@xxxxxxxx> wrote: >> >> >> >> Hi, >> >> >> >> While working on implementing copy_file_range(2) for the kernel CephFS >> >> client, I found an issue with truncated files that is described in [1]. >> >> The TL;DR is that, when executing a 'copy-from' OSD operation, both >> >> truncate_seq and truncate_size are copied from the base object into the >> >> target object. This, at least in the context of copy_file_range, doesn't >> >> make sense and will cause problems if, for example, the target file had >> >> previously been truncated, i.e. target.truncate_seq > base.truncate_seq >> >> (see test case in [1]). >> >> >> >> I've proposed a fix [2] but after discussing it with Gregory it sounds >> >> more like a hack than a real solution. Basically my patch simply adds a >> >> new flag to the 'copy-from' operation which a client can use so that >> >> truncate_{seq,size} aren't copied from the base object (and are *not* >> >> changed with the copy operation). >> >> >> >> Having my PR [2] tagged as 'pending-discussion', I decided to try to >> >> kick-off this discussion here in the mailing-list, maybe grabbing >> >> attention from other people with a deeper understanding of the OSD >> >> internals. >> >> >> >> Gregory's preferred solution would be to have the copy-from Op to allow >> >> to set truncate_seq and truncate_size values directly. Unfortunately, >> >> there seems to be no easy way of changing the interfaces to allow this >> >> to happen as the ceph_osd_op union (in rados.h) doesn't seem to be able >> >> to accommodate these 2 extra fields in copy_from. So, my initial >> >> questions would be: >> >> >> >> - What would be the options for extending copy-from to include this 2 >> >> extra fields? >> > >> > It just occurred to me — and let's be clear, this is a TERRIBLE hack — >> > that we might be able to handle this by generating compound >> > operations: the first op in sequence uses copy-from either as it >> > stands or in a changed one which doesn't copy the truncate values, and >> > the second writes the correct new truncate values. I'm not sure if >> > we've used compound ops with copy-from though so it'd take some >> > testing to make sure that works correctly and I've not forgotten some >> > new rule. >> > >> >> - I believe I understand the usage of truncate_{seq,size}, but it's >> >> really not clear to me whether there are any scenarios where we *do* >> >> want to modify truncate_{seq,size} while doing a copy-from. In the >> >> case of CephFS I don't think a copy-from will ever truncate a file, >> >> so the values could be left unchanged. But would the obvious solution >> >> of simply *never* copying these fields be a valid solution? (I >> >> suspect the answer is 'no' :-) >> > >> > This is all from memory and should be checked by someone, but the >> > truncate_seq and truncate_size are used approximately as follows: (I'm >> > definitely leaving out some of the manipulation for dealing with >> > multiple objects) >> > >> > 1) client sends truncate op to MDS >> > 2) MDS updates metadata and sends back new truncate_seq and >> > truncate_size to client, which saves them >> > 3) on a data op, the client sends the truncate_seq and truncate_size >> > to the OSD along with the read or write request. >> > 4) if the client has a newer truncate_seq than the OSD knows about, >> > the OSD will respect the provided truncate_size, which may mean >> > returning less data than it has, or performing a truncate on the >> > object itself. >> > >> > But now I know I'm forgetting something important here, because I >> > can't remember how we deal with existing truncate_seq = 1 >> > truncate_size = 10, then a truncate to 0 and a second truncate to 15 >> > (which results in the client providing truncate_seq 3 truncate_size 15 >> > to an OSD with a size 10 object that should be ignored). >> > >> > But leaving aside that issue, if we issue the OSD a copy-from op while >> > we have a newer truncate-seq, it needs to update to our truncate-seq. >> > Because we just gave it new data which it should return to any >> > follow-on readers! >> >> Aha! I was missing this obvious scenario. /me facepalms >> Of course we _need_ to update truncate_seq in this Op! >> >> > I am now wondering if maybe we're okay just having a second copy-from >> > op that DOES fill in values the client is already sending — it looks >> > like the existing copy-from op got bug-fixed to copy the truncate_seq >> > in .94.4 because that's necessary for a CephFS cache pool to function, >> > but obviously in this case we need different behavior. >> >> If I understand you correctly, you're saying that a new struct should be >> added to the union in ceph_osd_op, something like: >> >> struct { >> __le64 snapid; >> __le64 src_version; >> __u8 flags; >> __le32 src_fadvise_flags; >> __le64 truncate_size; >> __le32 truncate_seq; >> } __attribute__ ((packed)) copy_from2; >> >> But this will cause problems, because it's increasing the union size, >> right? > > That is one thing I was discussing. > > However, the client already updates the truncate_size and truncate_seq > on some (most? all?) write ops. (In userspace, Client::_write() calls > filer->write_trunc() and passes in the truncate_seq and > truncate_size). So that may actually not be an issue if we just create > a copy_from2 that looks like the existing copy_from except it doesn't > copy the truncate_seq. See > https://github.com/ceph/ceph/commit/6f9ee7961eee9ee3a61a07cbe0d8d289ee98fa9a. But in that case I don't see the advantage of creating a new copy_from operation -- why not simply add an extra flag to the current operation, as I proposed in my PR? Cheers, -- Luis > > Zheng, is this something you can advise on more carefully than I? :) > -Greg > >> >> > I really don't know the details of the implementation around this >> > enough any more to say with certainty what's going on. Hopefully you >> > can investigate or one of the more full-time CephFS people knows the >> > answer? :) >> >> Having this issue sorted out is definitely something I would love to see >> in nautilus. But I'm also not comfortable with the OSD internals, and >> figuring out all the details of dealing with a new Op seems like a >> *huge* task to me. I guess I could try, but it will take me a while :-/ >> >> Cheers, >> -- >> Luis >> >> > -Greg >> > >> >> >> >> Another problem is that the client will also need to figure out if the >> >> OSDs have this issue fixed so that it can decide whether to use the Op >> >> or not. My PR adds a new OSD_COPY_FILE_RANGE feature, overlapping >> >> SERVER_NAUTILUS, but Greg's preference is to use the osdmap. The kernel >> >> client does have _some_ support for osdmap but I believe some extra bits >> >> would be required to use this map in this case (although I would need to >> >> look closer to figure out what and how to do that). >> >> >> >> Anyway, I'm looking for ideas on how to sort this out so that we can >> >> have copy_file_range fixed in CephFS. >> >> >> >> [1] https://tracker.ceph.com/issues/37378 >> >> [2] https://github.com/ceph/ceph/pull/25374 >> >> >> >> Cheers, >> >> -- >> >> Luis >> > >