Hello > 9 авг. 2018 г., в 13:45, Alexandre DERUMIER <aderumier@xxxxxxxxx> написал(а): > > Hi, > > I don't have benched bluestore yet, > but with filestore, with same kind of setup (3 nodes, 24cores-3ghz=, I'm able to reach around 200k iops (cpu limited too), with ssd or nvme. > But I have 1 disk - 1 osd. I'm not sure of impact to have 4osd for 1 nvme. > Using 4 OSD per a NVMe disk is necessary to open the potential of NVMe disks, since one an OSD proccess can not fully load a disk. This recommendation is described here: http://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments#NVMe-SSD-partitioning > > ----- Mail original ----- > De: "Mike A" <mike.almateia@xxxxxxxxx> > À: "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> > Envoyé: Jeudi 9 Août 2018 03:25:58 > Objet: Low random write performance on All-NVMe Ceph cluster > > Hello! > > Recently, we have assembled and configured our new a ceph cluster that using NMMe disks. > By now this cluster have only 3 servers with 6 NVMe Intel DC P4500 4Tb disks on each, total 18 disks. > We were faced with a problem that on tests 100% Random Write 4k blocks, the total cluster performance is equal to no more than 65000 IOPS. > In operation on this test, the cluster actually CPU limiting, loading each core is about 90%. > > Ceph cluster hardware (each server have): > * 2 x CPU Intel Xeon Gold 6148 20core/2,4Ghz, maximum perfomance settings, HT enabled > * 96Gb RAM > * 6 x NVMe Intel DC P4500 4Tb disks, LBA formatted to 4k bloks, 3 disks per each NUMA node > * 2 x NIC Mellanox ConnectX-4 Lx EN 2port 25Gbit > * 2 x SATA SSD Intel DC S3520 for OS boot and the MON's base location > > All servers installed in 3 Racks, using same 2 switches Arista with 100Gbit ports. > > Ceph cluster software configuration (each server have): > * Ceph Luminous v12.2.7 > * Centos 7.5.1804 with latest updates > * Kernel is 3.10.0-862.9.1 > * Each NVMe disks have 4 partitions for 4 OSD processes. In total each server have 24 OSD, 4 per NVMe disk. > * Configured 2 team interfaces, for separate a client and a cluster traffic. Each team interfaces configured its own NIC (didn’t have cross NIC teaming) > * A monitors placed on servers and using Intel DC S3520 disks > > Ceph cluster config: > * CRUSHMAP failure domain = RACK > * Using bluestore, rocksdb on the same OSD disks, not separated. > * Bluestore compression not used > * Using crc32c sum type > > We began to investigate the problem and found a "towers" from nested functions or shared objects in flamegraphs, with a height of 110-120 blocks. > At the top of these "towers" an ceph's functions actually work [1]. > > I observe such an amount of nested functions for the first time. > > It also turned out that literally 50% of CPUs, an OSDs spends on network interaction (msg-worker-0,1,2). > > All perf data we get like that: perf record -g -p <PID> -- sleep 60. On already working OSD's processes. > > It's normal or we found something working wrong? Or just wrong hardware selection and limited by CPU/NVMe? > > Links > [1] https://drive.google.com/open?id=1m8aB0TTFJudyd1gGzNJ1_w6zr6LNgcjy - OSD0_RW > [2] https://drive.google.com/open?id=1JDr7rLRxSAwP3LMZihAtCXFQqidqIPY_ - OSD0_RR_4k_16jobs_1qd > [3] https://drive.google.com/open?id=19EiHAiQ4OrqhBcImb1QhQaeoIu0XXjVV - OSD0_RR_4k_64jobs_128qd > [4] https://drive.google.com/open?id=1-6pVEzpkz76fEBl9x8eF242h3Eksx52D. - Perf record data, process OSD, "perf record -g -p <PID> -- sleep 60» > > Ceph/Bluestore configs: > [global] > bluestore csum type = crc32c > bluefs buffered io = true > > ### Async Messenger > ms tcp read timeout = 120 > ms async transport type = posix > ms async set affinity = true > ms async affinity cores = 20,21,22,23 > > [osd] > bluestore cache size = 1073741824 > bluestore cache size ssd = 1073741824 > bluestore compression algorithm = snappy > bluestore compression mode = none > bluestore cache kv max = 1G > bluestore rocksdb options = compression=kNoCompression,recycle_log_file_num=64,compaction_style=kCompactionStyleLevel,write_buffer_size=4MB,target_file_size_base=4MB,max_background_compactions=64,level0_file_num_compaction_trigger=64,level0_slowdown_writes_trigger=128,level0_stop_writes_trigger=256,compaction_threads=32,flusher_threads=8,compaction_readahead_size=2MB > osd op num shards = 8 > osd op num threads per shard = 2 > > — > Mike, runs! > > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > — Mike, runs! -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html