Re: [RFC] New S3 Benchmark

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Lars!


Replies inline below


On 9/9/19 5:59 AM, Lars Marowsky-Bree wrote:
On 2019-08-20T08:53:50, Mark Nelson <mnelson@xxxxxxxxxx> wrote:

Hi Mark,

sorry for the slow response. Got sidetracked into business travel ;-)


No worries, I've got too many plates in the air right now so the delay is perfectly fine. :)



I looked over https://github.com/markhpc/hsbench features to consider
how it compares to the fio S3/Swift/DAV backend.

fio's so far can only target one bucket (it can only be pointed at one
http_host, but in theory, that could be a variable that could be filled
via variables or different jobs.

That also limits the number of endpoints - it doesn't RR across multiple
endpoints for the same job, but fio would easily support kicking off the
same job for multiple endpoints concurrently via one control file, so
that'd likely achieve something similar?


I think you could definitely craft something for fio that would give you a similar effect if not work in exactly the same way.



The whole latency reporting and test parametrization is, I think, an
advantage of fio.


On the hsbench side, I did try to craft the latency reporting piece so that you can ask for arbitrary percentiles.  Right now we just report min/50%/99%/max, but that could be fleshed out like fio's reporting.  One thing that I'm not super fond of in fio is the way interval logging is done, but that's probably somewhat minor in the grand scheme of things.



With one exception - it doesn't really currently support mixing object
sizes (block sizes) within one job easily given how I hacked that into
fio. I think it's probably safe, since I assume fio only ever reads
blocks/objects back with the same size it wrote them, but it's kinda
awkward ;-)


hsbench is quite simple in this regard too.  It doesn't do any kind of special mapping of object sizes, or even do anything like fio's random map so that if you are randomly reading objects back it doesn't hit the same one.  In hsbench we certainly could create some kind of object map (and even store and retrieve that map in S3 for future read tests), but currently hsbench is really targeted at put/get/list/delete for homogeneous data sets as well.



What fio also doesn't do is manage the buckets itself; it doesn't
support deletion/creation, it only does PUT/GET/DELETE for objects
within existing buckets. (Given that bucket creation could potentially
require quite a number of options to be passed, that was sort-of
intentional - I expected buckets to be provisioned before benchmarking
the cluster.)


It turns out after talking to our RGW guys that a really important test here is bucket list throughput.  I added that in hsbench 0.2 and Matt Benjamin has a PR in the works for testing unordered bucket listings as well.  This is especially important right now because with the way sharding works in RGW, you get much higher parallelism in one bucket if you shard it, but you hurt bucket list performance with higher read-amplification.



It can however delete all objects (just "trim" everything afterwards -
https://github.com/axboe/fio/blob/master/examples/http-s3.fio

And it doesn't support multipart uploads. But I think that's the same
for hsbench so far.


Yeah, no special provisions for multipart yet.



Of course, hsbench gets to benefit from the aws Go bindings. There
weren't any lightweight S3 libraries for C, so fio ended up with it's
tiny little rewrite of one. (I think swearing might have been involved
around getting the authentication right :-D)


I believe it!  hsbench is mostly just a glorified wrapper around the aws go bindings with some timing and statistics aggregation thrown on top.  The whole thing is like 1K lines of code.  You have more control if you own the S3 library too, but I think for hsbench the aws bindings appear to be good enough so far.



I wonder if you've been able to compare both of them yet; I'd be curious
which one is more lightweight and can get higher per client performance?


Honestly I haven't had time to try yet.  I'm trying to be a little clever about concurrency and am using atomics in go to keep away from the mutex/locking overhead that I suspect channels would bring, but I wouldn't be terribly surprised if the fio/s3 implementation could be faster if it's not doing anything silly. I've noticed that hsbench can still use a ton of CPU at high throughput rates.  16K gets/s can consume 6+ xeon E5-2650V3 cores (pretty old these days though).  I suspect that a big part of this is on the networking side and how connections are recycled. Probably need to take a bit more time to look at this in detail.


In any event, I haven't been totally idle.  Once I got hsbench working well enough to run tests, we started collecting information from our lab that lead to a bunch of work improving RGW.  You can see some of the test results here:


https://docs.google.com/spreadsheets/d/1q8MZJo9rp_3Kvs8ARaIxXf_TZbnnB0dRmrq8bCw2uDM/edit?usp=sharing
https://docs.google.com/spreadsheets/d/17SiwtVWy3jJdeXocFXrcPFvn8F5x1m8525BwjTevvME/edit?usp=sharing


And some of the PRs to make RGW faster here:


https://github.com/ceph/ceph/pull/29980(cls/rgw: [WIP] Make rgw_bucket_list encode bls rather than dir entries, markhpc) https://github.com/ceph/ceph/pull/29943(rgw/rgw_op: Remove get_val from hotpath via legacy options, markhpc) https://github.com/ceph/ceph/pull/29894(rgw/rgw_reshard: Don't dump RGWBucketReshard JSON in process_single_logshard, markhpc) https://github.com/ceph/ceph/pull/29852(rgw: move bucket reshard checks out of write path, cbodley)


Casey also recently posted a proposal to dev@xxxxxxx for allowing writes during resharding which would fix some of those stalls seen in the graphs.  Matt also has a good idea regarding changing the shard hashing behavior to preserve ordering so that bucket listing times should remain fast even with large numbers of shards.


Also, if you look at those graphs you'll see a lot of fluctuation on the put side even after those PRs are applied.  Putting the DB/WAL on ramdisk (without bluefs) stabilizes throughput pretty significantly, so it looks like at least on the write path side hsbench is fast enough to showcase that putting rocksdb on optane or a pmem devices might stabilize write throughput rates and potentially improve overall small object write performance.  Once the new community performance nodes are setup that's one of the first tests I intend to run on them.



Now, my desire for having this all be in the same tool across all
benchmarks if feasible, but seeing the benefits of using a language for
which an Amazon SDK is available ...
https://medium.com/learning-the-go-programming-language/calling-go-functions-from-other-languages-4c7d8bcc69bf
A fio plugin wrapper around hsbench's S3 module ought to be possible?


I was actually excited to hear about the fio effort because it's a second option to compare results against!  I would be a little sad to lose that if you adopted some of the hsbench code for the backend, but on the other hand I understand why you might want to do it (certainly the bindings are very convenient!).


Mark
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx




[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux