Re: Ceph Performance very bad even in Memory?!

"sascha a." <sascha.arthur@xxxxxxxxx> · Sun, 30 Jan 2022 19:23:55 +0100

Hey Vitalif,

I found your wiki as well as your own software before. Pretty impressive
and I love your work!
I especially like your "Theoretical Maximum Random Access Performance"
-Section.
That is exactly what I would expect about cephs performance as well (which
is by design very close to your vitastor) as well.

Hoped to get your attention on this topic, because it seems you're a
performance/perfection driven guy - like me, as well.
Most people just accept the way software works and take bad performance for
granted. Most excuses went on "cheap hardware", which I just eliminated by
using ram disks.

Anyways, did you ever test a ceph cluster against ram disks?
I mean, even after consuming all of your wiki and using my 20+ Years of
unix/programming knowledge, this just tells me that the used nvme (with or
without capacitor) doesn't matter at all. When it is so badly utilized like
it's currently looking like.
I really can't imagine that the software is currently behaving.. aren't the
devs testing sometimes performance?

There must be very bad issues about synchronization/single threaded
bottlenecks or missed events that it behaves THAT badly even with in memory
osds.
On the good side this means, just a few simple code tweaks (threading,async
wait, queues) should bring a huge impact on performance.

Currently I'm compiling ceph with crimson (the compile process seems also a
bit messy).
To check what the expected performance is with the new seastore backend.

I will check Linstor as well, thanks!

On Sun, Jan 30, 2022 at 6:37 PM <vitalif@xxxxxxxxxx> wrote:

> Hi, yes, it has software bottlenecks :-)
>
> https://yourcmc.ru/wiki/Ceph_performance
>
> If you just need block storage - try Vitastor https://vitastor.io/
> https://yourcmc.ru/git/vitalif/vitastor/src/branch/master/README.md - I
> made it very architecturally similar to Ceph - or if you're fine with even
> simpler DRBD-based design (no distribution of every image across the whole
> cluster) - even Linstor.
>
> Ceph is designed for multiple use cases at once, its design tradeoffs are
> fine for S3 (cold/warm) object storage, but high-performance SSD-based
> block storage is a different thing :-)
>
> > Hello,
> >
> > Im currently in progress of setting up a production ceph cluster on a 40
> > gbit network (for sure 40gb internal and public network).
> >
> > Did a lot of machine/linux tweeking already:
> >
> > - cpupower state disable
> > - lowlatency kernel
> > - kernel tweekings
> > - rx buffer optimize
> > - affinity mappings
> > - correct bus mapping of pcie cards
> > - mtu..
> > + many more
> >
> > My machines are connected over a switch which is capable of doing
> multiple
> > TBit/s.
> >
> > iperf result between two machines:
> > single connection ~20 Gbit/s (reaching single core limits)
> > multiple connections ~39 Gbit/s
> >
> > Perfect starting point i would say. Let's assume we only reach 20 Gbit/s
> as
> > network speed.
> >
> > Now i wanted to check how much overhead ceph has and what's the absolute
> > maximum i could get out of my cluster.
> >
> > Using 8 Servers (16 cores + HT), (all the same cpus/ram/network cards).
> > Having plenty of RAM which I'm using for my tests. Simple pool by using
> 3x
> > replication.
> > For that reason and to prove that my servers have high quality and speed
> I
> > created 70 GB RAM-drives on all of them.
> >
> > The RAM-drives were created by using the kernel module "brd".
> > Benchmarking the RAM-drives gave the following result by using fio:
> >
> > 5m IO/s read@4k
> > 4m IO/s write@4k
> >
> > read and write | latency below 10 us@QD=16
> > read and write | latency below 50 us@QD=256
> >
> > 1,5 GB/s sequential read@4k (QD=1)
> > 1,0 GB/s sequential write@4k (QD=1)
> >
> > 15 GB/s read@4k (QD=256)
> > 10 GB/s write@4k (QD=256)
> >
> > Pretty impressive, disks we are all dreaming about I would say.
> >
> > Making sure i don't bottleneck anything i created following Setup:
> >
> > - 3 Servers, each running a mon,mgr and mds (all running in RAM including
> > their dirs in /var/lib/ceph.. by using tmpfs/brd ramdisk)
> > - 4 Servers mapping their ramdrive as OSDs, created with bluestore by
> using
> > ceph-volume raw or lvm.
> > - 1 Server using rbd-nbd to map one rbd as drive and to benchmark it
> >
> > I would in this scenario expect impressive results, the only
> > bottleneck between the servers is the 20 Gbit network speed.
> > Everything else is running completely in low latency ECC memory and
> should
> > be blazing fast until the network speed is reached.
> >
> > The benchmark was monitored by using this tool here:
> > https://github.com/ceph/ceph/blob/master/src/tools/histogram_dump.py
> also
> > by looking at the raw data of "ceph daemon osd.7 perf dump".
> >
> > *Result:*
> > Either there is something really going wrong or ceph has huge bottlenecks
> > inside the software which should be solved...
> >
> > histogram_dump spiked often the latency between "1M and 51M", ... which
> is
> > when i read it correctly seconds?!
> > How is that possible? This should always be between 0-99k..
> >
> > The result of perf dump was also crazy slow:
> > https://pastebin.com/ukV0LXWH
> >
> > especially kpis here:
> > - op_latency
> > - op_w_latency
> > - op_w_prepare_latency
> > - state_deferred_cleanup_lat
> >
> > all in areas of milliseconds?
> >
> > FIO reached against the nbd drive 40k IOPs@4k with QD=256 and latency
> which
> > is in 1-7 milliseconds..
> >
> > Calculating with 40gbit, 4k of data flows through the network
> > about 4*1024/40000000000/8 = 128ns, lets say its 2us cause of the
> overhead,
> > multiplied by 3 RTT cause of the replication... should still reach
> latency
> > easy below 50us.
> > Also I read a lot of slow flushing disks, but there's definitely
> something
> > going on in the software when it's not even capable of doing in memory
> > disks without taking milliseconds..
> >
> > All servers were using ubuntu 21.10 and "ceph version 16.2.6
> > (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable) "
> >
> > What did I do wrong? What's going on here?!
> > Please help me out!
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx