On Mon, Jan 27, 2020 at 7:52 PM Luis Henriques <lhenriques@xxxxxxxx> wrote: > > On Mon, Jan 27, 2020 at 07:16:17PM +0100, Ilya Dryomov wrote: > > On Mon, Jan 27, 2020 at 5:43 PM Luis Henriques <lhenriques@xxxxxxxx> wrote: > > > > > > Hi, > > > > > > As discussed here[1] I'm sending an RFC patchset that does the > > > parallelization of the requests sent to the OSDs during a copy_file_range > > > syscall in CephFS. > > > > > > [1] https://lore.kernel.org/lkml/20200108100353.23770-1-lhenriques@xxxxxxxx/ > > > > > > I've also some performance numbers that I wanted to share. Here's a > > > description of the very simple tests I've run: > > > > > > - create a file with 200 objects in it > > > * i.e. tests with different object sizes mean different file sizes > > > - drop all caches and umount the filesystem > > > - Measure: > > > * mount filesystem > > > * full file copy (with copy_file_range) > > > * umount filesystem > > > > > > Tests were repeated several times and the average value was used for > > > comparison. > > > > > > DISCLAIMER: > > > These numbers are only indicative, and different clusters and client > > > configs will for sure show different performance! More rigorous tests > > > would be require to validate these results. > > > > > > Having as baseline a full read+write (basically, a copy_file_range > > > operation within a filesystem mounted without the 'copyfrom' option), > > > here's some values for different object sizes: > > > > > > 8M 4M 1M 65k > > > read+write 100% 100% 100% 100% > > > sequential 51% 52% 83% >100% > > > parallel (throttle=1) 51% 52% 83% >100% > > > parallel (throttle=0) 17% 17% 83% >100% > > > > > > Notes: > > > > > > - 'parallel (throttle=0)' was a test where *all* the requests (i.e. 200 > > > requests to copy the 200 objects in the file) were sent to the OSDs and > > > the wait for requests completion is done at the end only. > > > > > > - 'parallel (throttle=1)' was just a control test, where the wait for > > > completion is done immediately after a request is sent. It was expected > > > to be very similar to the non-optimized ('sequential') tests. > > > > > > - These tests were executed on a cluster with 40 OSDs, spread across 5 > > > (bare-metal) nodes. > > > > > > - The tests with object size of 65k show that copy_file_range definitely > > > doesn't scale to files with small object sizes. '> 100%' actually means > > > more than 10x slower. > > > > > > Measuring the mount+copy+umount masks the actual difference between > > > different throttle values due to the time spent in mount+umount. Thus, > > > there was no real difference between throttle=0 (send all and wait) and > > > throttle=20 (send 20, wait, send 20, ...). But here's what I observed > > > when measuring only the copy operation (4M object size): > > > > > > read+write 100% > > > parallel (throttle=1) 56% > > > parallel (throttle=5) 23% > > > parallel (throttle=10) 14% > > > parallel (throttle=20) 9% > > > parallel (throttle=5) 5% > > > > Was this supposed to be throttle=50? > > Ups, no it should be throttle=0 (i.e. no throttle). > > > > > > > Anyway, I'll still need to revisit patch 0003 as it doesn't follow the > > > suggestion done by Jeff to *not* add another knob to fine-tune the > > > throttle value -- this patch adds a kernel parameter for a knob that I > > > wanted to use in my testing to observe different values of this throttle > > > limit. > > > > > > The goal is to probably to drop this patch and do the throttling in patch > > > 0002. I just need to come up with a decent heuristic. Jeff's suggestion > > > was to use rsize/wsize, which are set to 64M by default IIRC. Somehow I > > > feel that it should be related to the number of OSDs in the cluster > > > instead, but I'm not sure how. And testing these sort of heuristics would > > > require different clusters, which isn't particularly easy to get. Anyway, > > > comments are welcome! > > > > I agree with Jeff, this throttle is certainly not worth a module > > parameter (or a mount option). I would start with something like > > C * (wsize / object size) and pick C between 1 and 4. > > Sure, I also agree with not adding the new parameter or mount option. > It's just tricky to pick (and test!) the best formula to use. From your > proposal the throttle value would be by default between 16 and 64; those > probably work fine in some situations (for example, in the cluster I used > for running my tests). But for a really big cluster, with hundreds of > OSDs, it's difficult to say. We don't really need a single client to be capable of spraying the entire cluster in a single operation — as the wsize is already an effective control over how parallel a single write is allowed to be, I think we're okay using it as the basis for copy_file_range as well without worrying about scaling it up!. -Greg > > Anyway, I'll come up with a proposal for the next revision. And thanks a > lot for your feedback. > > Cheers, > -- > Luís >