[RGW] Strange write performance issues

Alexandru Cucu <me@xxxxxxxxxxx> · Wed, 10 Jun 2020 17:05:39 +0300

Hello Ceph users,

We've been doing some tests with Ceph RGW. Mostly wanted to see how
Ceph will do with a large number of objects in a single bucket.

For the test we had a cluster with 3 nodes, running collocated OSDs,
MON, MGR, and RGW.

    CPU: 2x Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz (48 threads in total)
    RAM: 128 GB
    Network: 4 x 10Gbps in a single LACP bond.
    OSD drives: 2 x 800GB NVMe Write Intensive
    Ceph version: Nautilus (14.2.9)

The data pool uses replica 3. Ceph has only default configuration values.

We ran a COSBench test with 200 threads, writing 100 Million objects
with a size of 4KB and noticed that performance started at ~500
ops/sec, then, multiple times, jumped up to more than 7000 ops/sec and
back down to less than 500 ops/sec.
https://i.imgur.com/TopM6sw.png
https://i.imgur.com/y5Mu9F3.png

All the collected data from the COSBench test:
https://docs.google.com/spreadsheets/d/1wAwrg9nE2e_MItQB5wVrmLIO-KH7hkUBtz06YpFQtXA/edit?usp=sharing

We have noticed that during the low-performance time, the cluster is
doing read IO and the response time is very high:
https://i.imgur.com/tgZ5WLF.png
https://i.imgur.com/2PiEGZB.png

Here are the IO stats for the index and data pools:
https://i.imgur.com/hC3HZ1R.png
https://i.imgur.com/TwsXghv.png

We did the test multiple times on clean clusters, with similar results.
We also ran a second test, writing 50M new objects to the same bucket
we have previously filled with 100M objects and everything seems to be
working perfectly.

Does anyone know why this is happening?
The response times are huge and would be a disaster in a production environment!

Thank you,
---
Alex Cucu
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx