Re: Ceph Performance very bad even in Memory?!

Виталий Филиппов <vitalif@xxxxxxxxxx> · Sun, 30 Jan 2022 21:39:45 +0300

Yes I also tried it on ram disks and got basically the same results as with nvmes :-) capacitors do matter though because while you get 1000 t1q1 iops with them you get 100-200 iops without them. And it starts to resemble HDD at that point :-)

>From my profiling experiments I think there's no way to make bluestore faster easily.

I'm also interested in Crimson benchmarks so please post them here after testing :-)

30 января 2022 г. 21:23:55 GMT+03:00, "sascha a." <sascha.arthur@xxxxxxxxx> пишет:
>Hey Vitalif,
>
>I found your wiki as well as your own software before. Pretty
>impressive
>and I love your work!
>I especially like your "Theoretical Maximum Random Access Performance"
>-Section.
>That is exactly what I would expect about cephs performance as well
>(which
>is by design very close to your vitastor) as well.
>
>Hoped to get your attention on this topic, because it seems you're a
>performance/perfection driven guy - like me, as well.
>Most people just accept the way software works and take bad performance
>for
>granted. Most excuses went on "cheap hardware", which I just eliminated
>by
>using ram disks.
>
>Anyways, did you ever test a ceph cluster against ram disks?
>I mean, even after consuming all of your wiki and using my 20+ Years of
>unix/programming knowledge, this just tells me that the used nvme (with
>or
>without capacitor) doesn't matter at all. When it is so badly utilized
>like
>it's currently looking like.
>I really can't imagine that the software is currently behaving.. aren't
>the
>devs testing sometimes performance?
>
>There must be very bad issues about synchronization/single threaded
>bottlenecks or missed events that it behaves THAT badly even with in
>memory
>osds.
>On the good side this means, just a few simple code tweaks
>(threading,async
>wait, queues) should bring a huge impact on performance.
>
>Currently I'm compiling ceph with crimson (the compile process seems
>also a
>bit messy).
>To check what the expected performance is with the new seastore
>backend.
>
>I will check Linstor as well, thanks!
>
>
>
>On Sun, Jan 30, 2022 at 6:37 PM <vitalif@xxxxxxxxxx> wrote:
>
>> Hi, yes, it has software bottlenecks :-)
>>
>> https://yourcmc.ru/wiki/Ceph_performance
>>
>> If you just need block storage - try Vitastor https://vitastor.io/
>> https://yourcmc.ru/git/vitalif/vitastor/src/branch/master/README.md -
>I
>> made it very architecturally similar to Ceph - or if you're fine with
>even
>> simpler DRBD-based design (no distribution of every image across the
>whole
>> cluster) - even Linstor.
>>
>> Ceph is designed for multiple use cases at once, its design tradeoffs
>are
>> fine for S3 (cold/warm) object storage, but high-performance
>SSD-based
>> block storage is a different thing :-)
>>
>> > Hello,
>> >
>> > Im currently in progress of setting up a production ceph cluster on
>a 40
>> > gbit network (for sure 40gb internal and public network).
>> >
>> > Did a lot of machine/linux tweeking already:
>> >
>> > - cpupower state disable
>> > - lowlatency kernel
>> > - kernel tweekings
>> > - rx buffer optimize
>> > - affinity mappings
>> > - correct bus mapping of pcie cards
>> > - mtu..
>> > + many more
>> >
>> > My machines are connected over a switch which is capable of doing
>> multiple
>> > TBit/s.
>> >
>> > iperf result between two machines:
>> > single connection ~20 Gbit/s (reaching single core limits)
>> > multiple connections ~39 Gbit/s
>> >
>> > Perfect starting point i would say. Let's assume we only reach 20
>Gbit/s
>> as
>> > network speed.
>> >
>> > Now i wanted to check how much overhead ceph has and what's the
>absolute
>> > maximum i could get out of my cluster.
>> >
>> > Using 8 Servers (16 cores + HT), (all the same cpus/ram/network
>cards).
>> > Having plenty of RAM which I'm using for my tests. Simple pool by
>using
>> 3x
>> > replication.
>> > For that reason and to prove that my servers have high quality and
>speed
>> I
>> > created 70 GB RAM-drives on all of them.
>> >
>> > The RAM-drives were created by using the kernel module "brd".
>> > Benchmarking the RAM-drives gave the following result by using fio:
>> >
>> > 5m IO/s read@4k
>> > 4m IO/s write@4k
>> >
>> > read and write | latency below 10 us@QD=16
>> > read and write | latency below 50 us@QD=256
>> >
>> > 1,5 GB/s sequential read@4k (QD=1)
>> > 1,0 GB/s sequential write@4k (QD=1)
>> >
>> > 15 GB/s read@4k (QD=256)
>> > 10 GB/s write@4k (QD=256)
>> >
>> > Pretty impressive, disks we are all dreaming about I would say.
>> >
>> > Making sure i don't bottleneck anything i created following Setup:
>> >
>> > - 3 Servers, each running a mon,mgr and mds (all running in RAM
>including
>> > their dirs in /var/lib/ceph.. by using tmpfs/brd ramdisk)
>> > - 4 Servers mapping their ramdrive as OSDs, created with bluestore
>by
>> using
>> > ceph-volume raw or lvm.
>> > - 1 Server using rbd-nbd to map one rbd as drive and to benchmark
>it
>> >
>> > I would in this scenario expect impressive results, the only
>> > bottleneck between the servers is the 20 Gbit network speed.
>> > Everything else is running completely in low latency ECC memory and
>> should
>> > be blazing fast until the network speed is reached.
>> >
>> > The benchmark was monitored by using this tool here:
>> >
>https://github.com/ceph/ceph/blob/master/src/tools/histogram_dump.py
>> also
>> > by looking at the raw data of "ceph daemon osd.7 perf dump".
>> >
>> > *Result:*
>> > Either there is something really going wrong or ceph has huge
>bottlenecks
>> > inside the software which should be solved...
>> >
>> > histogram_dump spiked often the latency between "1M and 51M", ...
>which
>> is
>> > when i read it correctly seconds?!
>> > How is that possible? This should always be between 0-99k..
>> >
>> > The result of perf dump was also crazy slow:
>> > https://pastebin.com/ukV0LXWH
>> >
>> > especially kpis here:
>> > - op_latency
>> > - op_w_latency
>> > - op_w_prepare_latency
>> > - state_deferred_cleanup_lat
>> >
>> > all in areas of milliseconds?
>> >
>> > FIO reached against the nbd drive 40k IOPs@4k with QD=256 and
>latency
>> which
>> > is in 1-7 milliseconds..
>> >
>> > Calculating with 40gbit, 4k of data flows through the network
>> > about 4*1024/40000000000/8 = 128ns, lets say its 2us cause of the
>> overhead,
>> > multiplied by 3 RTT cause of the replication... should still reach
>> latency
>> > easy below 50us.
>> > Also I read a lot of slow flushing disks, but there's definitely
>> something
>> > going on in the software when it's not even capable of doing in
>memory
>> > disks without taking milliseconds..
>> >
>> > All servers were using ubuntu 21.10 and "ceph version 16.2.6
>> > (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable) "
>> >
>> > What did I do wrong? What's going on here?!
>> > Please help me out!
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx
>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>

-- 
With best regards,
  Vitaliy Filippov
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx