I have no idea how you get 66k write iops with one OSD ) I've just repeated a test by creating a test pool on one NVMe OSD with 8 PGs (all pinned to the same OSD with pg-upmap). Then I ran 4x fio randwrite q128 over 4 RBD images. I got 17k iops. OK, in fact that's not the worst result for Ceph, but problem is that I only get 30k write iops when benchmarking 4 RBD images spread over all OSDs _in_the_same_cluster_. And there are 14 of them. >> I've just finishing doing our own benchmarking, and I can say, you >> want to do something very unbalanced and CPU bounded. >> >> 1. Ceph consume a LOT of CPU. My peak value was around 500% CPU per >> ceph-osd at top-performance (see the recent thread on 'ceph on brd') >> with more realistic numbers around 300-400% CPU per device. >>> In fact in isolation on the test setup that Intel donated for >>> community ceph R&D we've pushed a single OSD to consume around 1400% >>> CPU at 80K write IOPS! :) I agree though, we typical see a peak of >>> about 500-600% CPU per OSD on multi-node clusters with a >>> correspondingly lower write throughput. I do believe that in some >>> cases the mix of IO we are doing is causing us to at least be >>> partially bound by disk write latency with the single writer thread >>> in the rocksdb WAL though. >> >> I'd really like to see how they done this without offloading (their >> configuration). > > I went back and looked over some of the old results. I didn't find the > really high test scores (and now that I'm thinking about it they may > have been from when I was ripping out pglog OMAP updates!), but here's > one example I did find from earlier testing last winter that at least > got into roughly the right ballpark with stock master from last December > (~66K IOPS): > > Avg 4K FIO randwrite IOPS: 65841.7 > > - 1 p4510 NVMe backed OSD > > - 8GB osd memory target > > - 4K min alloc size > > - 4 clients, 1 128GB RBD volume per client, io_depth=128, time=300s > > - 128 PGs (fixed) > > - latency-network tuned profile > > - bluestore_rocksdb_options = > "compression=kNoCompression,max_total_wal_size=1073741824,max_write_buffer_number=16,min_write_buffe > _number_to_merge=3,recycle_log_file_num=4,write_buffer_size=67108864,writable_file_max_buffer_size=0 > compaction_readahead_size=2097152,max_background_compactions=2,compaction_style=kCompactionStyleUniv > rsal" > > - bluestore_default_buffered_write = true > > - bluestore_default_buffered_read = true > > - rbd cache = false > > Beyond that general stuff like background scrubbing and pg autoscaling > was disabled. I should note that these results are using universal > compaction in rocksdb which you probably don't want to do in production > because it can require 2x the total DB space to perform a compaction. > It might actually be feasible now that we are doing column family > sharding thanks to Adam's PR because you will only need 2x the space of > any individual column family for compaction rather than the whole DB, > but it's still unsupported for now. > > Mark > >>> >> >> 2. Ceph is unable to deliver more than 12k IOPS per ceph-osd (may be >> a little more with top-tier low-core high-frequency CPU, but not >> much). So, super-duper-nvme wont make difference. (btw, I have a >> stupid idea to try to run two ceph-osd from the same LV with a >> single PV underneath VG, but it not tested). >>> I'm curious if you've tried octopus+ yet? We refactored bluestore's >>> caches which internally has proven to help quite a bit with latency >>> bound workloads as it reduces lock contention in onode cache shards >>> and the impact of cache trimming (no more single trimming trim thread >>> constantly grabbing the lock for long periods of time!). In a 64 >>> NVMe drive setup (P4510s), we were able to do a little north of 400K >>> write IOPS with 3x replication, so about 19K IOPs per OSD once you >>> factor rep in. Also, in Nautilus you can see real benefits wtih >>> running multiple OSDs on a single device but with Octopus and master >>> we've pretty much closed the gap on our test setup: >> >> It's octopus. I was doing single-osd benchmark, removing all movable >> parts (brd instead of nvme, no network, size=1, etc). Moreover, I've >> focused on rados benchmark, as RBD is just a derivative from rados >> performance. >> >> Anyway, big thank you for input. >> >>> https://docs.google.com/spreadsheets/d/1e5eTeHdZnSizoY6AUjH0knb4jTCW7KMU4RoryLX9EHQ/edit?usp=sharing >>> >>> Generally speaking using the latency-performance or latency-network >>> tuned profiles helps (mostly due to avoid C state CPU transitions) as >>> does higher clock speeds. Not using replication helps but that's >>> obviously not a realistic solution for most people. :) >> >> I used size=1 and 'no ssd, no network' as upper bound. If allows to >> find limits for ceph-osd performance. Any real-life things >> (replication, network, real block devices) will make things worse, not >> better. Knowing upper performance bound is really nice when start to >> choose server configuration. >> >>> >> >> 3. You wll find that any given client performance is heavily limited >> by sum of all RTT in the network, plus own latencies of ceph, so >> very fast NVME give a diminishing return. >> 4. CPU bounded ceph-osd completely wipe any differences for >> underlying devices (except for desktop-class crawlers). >> >> You can run your own tests, even without fancy 48-nvme boxes - just >> run ceph-osd on brd (block ram disk). ceph-osd won't run any faster >> on anything else (ramdisk is the fastest), so numbers you get from >> brd is supremum (upper bound) for theoretical performance. >> >> Given max 400-500% CPU per ceph-osd I'd say you need to keep number >> of NVME in server below 12, or, 15 (but sometimes you'll get CPU >> saturation). >> >> In my opinion less fancy boxes with smaller number of drives per >> server (but larger number of servers) would make your (or your >> operation team's) life much less stressful. >>> That's pretty much the advice I've been giving people since the >>> Inktank days. It costs more and is lower density, but the design is >>> simpler, you are less likely to under provision CPU, less likely to >>> run into memory bandwidth bottlenecks, and you have less recovery to >>> do when a node fails. Especially now with how many NVMe drives you >>> can fit in a single 1U server! >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx