Re: [RFC] New S3 Benchmark

Mark Nelson <mnelson@xxxxxxxxxx> · Mon, 9 Sep 2019 09:06:05 -0500

Hi Lars!

Replies inline below

On 9/9/19 5:59 AM, Lars Marowsky-Bree wrote:
On 2019-08-20T08:53:50, Mark Nelson <mnelson@xxxxxxxxxx> wrote:

Hi Mark,

sorry for the slow response. Got sidetracked into business travel ;-)

No worries, I've got too many plates in the air right now so the delay 
is perfectly fine. :)

I looked over https://github.com/markhpc/hsbench features to consider
how it compares to the fio S3/Swift/DAV backend.

fio's so far can only target one bucket (it can only be pointed at one
http_host, but in theory, that could be a variable that could be filled
via variables or different jobs.

That also limits the number of endpoints - it doesn't RR across multiple
endpoints for the same job, but fio would easily support kicking off the
same job for multiple endpoints concurrently via one control file, so
that'd likely achieve something similar?

I think you could definitely craft something for fio that would give you 
a similar effect if not work in exactly the same way.

The whole latency reporting and test parametrization is, I think, an
advantage of fio.

On the hsbench side, I did try to craft the latency reporting piece so 
that you can ask for arbitrary percentiles.  Right now we just report 
min/50%/99%/max, but that could be fleshed out like fio's reporting.  
One thing that I'm not super fond of in fio is the way interval logging 
is done, but that's probably somewhat minor in the grand scheme of things.

With one exception - it doesn't really currently support mixing object
sizes (block sizes) within one job easily given how I hacked that into
fio. I think it's probably safe, since I assume fio only ever reads
blocks/objects back with the same size it wrote them, but it's kinda
awkward ;-)

hsbench is quite simple in this regard too.  It doesn't do any kind of 
special mapping of object sizes, or even do anything like fio's random 
map so that if you are randomly reading objects back it doesn't hit the 
same one.  In hsbench we certainly could create some kind of object map 
(and even store and retrieve that map in S3 for future read tests), but 
currently hsbench is really targeted at put/get/list/delete for 
homogeneous data sets as well.

What fio also doesn't do is manage the buckets itself; it doesn't
support deletion/creation, it only does PUT/GET/DELETE for objects
within existing buckets. (Given that bucket creation could potentially
require quite a number of options to be passed, that was sort-of
intentional - I expected buckets to be provisioned before benchmarking
the cluster.)

It turns out after talking to our RGW guys that a really important test 
here is bucket list throughput.  I added that in hsbench 0.2 and Matt 
Benjamin has a PR in the works for testing unordered bucket listings as 
well.  This is especially important right now because with the way 
sharding works in RGW, you get much higher parallelism in one bucket if 
you shard it, but you hurt bucket list performance with higher 
read-amplification.

It can however delete all objects (just "trim" everything afterwards -
https://github.com/axboe/fio/blob/master/examples/http-s3.fio

And it doesn't support multipart uploads. But I think that's the same
for hsbench so far.

Yeah, no special provisions for multipart yet.

Of course, hsbench gets to benefit from the aws Go bindings. There
weren't any lightweight S3 libraries for C, so fio ended up with it's
tiny little rewrite of one. (I think swearing might have been involved
around getting the authentication right :-D)

I believe it!  hsbench is mostly just a glorified wrapper around the aws 
go bindings with some timing and statistics aggregation thrown on top.  
The whole thing is like 1K lines of code.  You have more control if you 
own the S3 library too, but I think for hsbench the aws bindings appear 
to be good enough so far.

I wonder if you've been able to compare both of them yet; I'd be curious
which one is more lightweight and can get higher per client performance?

Honestly I haven't had time to try yet.  I'm trying to be a little 
clever about concurrency and am using atomics in go to keep away from 
the mutex/locking overhead that I suspect channels would bring, but I 
wouldn't be terribly surprised if the fio/s3 implementation could be 
faster if it's not doing anything silly. I've noticed that hsbench can 
still use a ton of CPU at high throughput rates.  16K gets/s can consume 
6+ xeon E5-2650V3 cores (pretty old these days though).  I suspect that 
a big part of this is on the networking side and how connections are 
recycled. Probably need to take a bit more time to look at this in detail.

In any event, I haven't been totally idle.  Once I got hsbench working 
well enough to run tests, we started collecting information from our lab 
that lead to a bunch of work improving RGW.  You can see some of the 
test results here:

https://docs.google.com/spreadsheets/d/1q8MZJo9rp_3Kvs8ARaIxXf_TZbnnB0dRmrq8bCw2uDM/edit?usp=sharing 

https://docs.google.com/spreadsheets/d/17SiwtVWy3jJdeXocFXrcPFvn8F5x1m8525BwjTevvME/edit?usp=sharing

And some of the PRs to make RGW faster here:

https://github.com/ceph/ceph/pull/29980(cls/rgw: [WIP] Make 
rgw_bucket_list encode bls rather than dir entries, markhpc)
https://github.com/ceph/ceph/pull/29943(rgw/rgw_op: Remove get_val from 
hotpath via legacy options, markhpc)
https://github.com/ceph/ceph/pull/29894(rgw/rgw_reshard: Don't dump 
RGWBucketReshard JSON in process_single_logshard, markhpc)
https://github.com/ceph/ceph/pull/29852(rgw: move bucket reshard checks 
out of write path, cbodley)

Casey also recently posted a proposal to dev@xxxxxxx for allowing writes 
during resharding which would fix some of those stalls seen in the 
graphs.  Matt also has a good idea regarding changing the shard hashing 
behavior to preserve ordering so that bucket listing times should remain 
fast even with large numbers of shards.

Also, if you look at those graphs you'll see a lot of fluctuation on the 
put side even after those PRs are applied.  Putting the DB/WAL on 
ramdisk (without bluefs) stabilizes throughput pretty significantly, so 
it looks like at least on the write path side hsbench is fast enough to 
showcase that putting rocksdb on optane or a pmem devices might 
stabilize write throughput rates and potentially improve overall small 
object write performance.  Once the new community performance nodes are 
setup that's one of the first tests I intend to run on them.

Now, my desire for having this all be in the same tool across all
benchmarks if feasible, but seeing the benefits of using a language for
which an Amazon SDK is available ...
https://medium.com/learning-the-go-programming-language/calling-go-functions-from-other-languages-4c7d8bcc69bf
A fio plugin wrapper around hsbench's S3 module ought to be possible?

I was actually excited to hear about the fio effort because it's a 
second option to compare results against!  I would be a little sad to 
lose that if you adopted some of the hsbench code for the backend, but 
on the other hand I understand why you might want to do it (certainly 
the bindings are very convenient!).

Mark
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx