Re: Low random write performance on All-NVMe Ceph cluster

Mike A <mike.almateia@xxxxxxxxx> · Thu, 9 Aug 2018 18:04:53 +0300

Hello

> 9 авг. 2018 г., в 13:45, Alexandre DERUMIER <aderumier@xxxxxxxxx> написал(а):
> 
> Hi,
> 
> I don't have benched bluestore yet,
> but with filestore, with same kind of setup (3 nodes, 24cores-3ghz=, I'm able to reach around 200k iops (cpu limited too), with ssd or nvme.
> But I have 1 disk - 1 osd. I'm not sure of impact to have 4osd for 1 nvme.
> 

Using 4 OSD per a NVMe disk is necessary to open the potential of NVMe disks, since one an OSD proccess can not fully load a disk.
This recommendation is described here: http://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments#NVMe-SSD-partitioning

> 
> ----- Mail original -----
> De: "Mike A" <mike.almateia@xxxxxxxxx>
> À: "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx>
> Envoyé: Jeudi 9 Août 2018 03:25:58
> Objet: Low random write performance on All-NVMe Ceph cluster
> 
> Hello! 
> 
> Recently, we have assembled and configured our new a ceph cluster that using NMMe disks. 
> By now this cluster have only 3 servers with 6 NVMe Intel DC P4500 4Tb disks on each, total 18 disks. 
> We were faced with a problem that on tests 100% Random Write 4k blocks, the total cluster performance is equal to no more than 65000 IOPS. 
> In operation on this test, the cluster actually CPU limiting, loading each core is about 90%. 
> 
> Ceph cluster hardware (each server have): 
> * 2 x CPU Intel Xeon Gold 6148 20core/2,4Ghz, maximum perfomance settings, HT enabled 
> * 96Gb RAM 
> * 6 x NVMe Intel DC P4500 4Tb disks, LBA formatted to 4k bloks, 3 disks per each NUMA node 
> * 2 x NIC Mellanox ConnectX-4 Lx EN 2port 25Gbit 
> * 2 x SATA SSD Intel DC S3520 for OS boot and the MON's base location 
> 
> All servers installed in 3 Racks, using same 2 switches Arista with 100Gbit ports. 
> 
> Ceph cluster software configuration (each server have): 
> * Ceph Luminous v12.2.7 
> * Centos 7.5.1804 with latest updates 
> * Kernel is 3.10.0-862.9.1 
> * Each NVMe disks have 4 partitions for 4 OSD processes. In total each server have 24 OSD, 4 per NVMe disk. 
> * Configured 2 team interfaces, for separate a client and a cluster traffic. Each team interfaces configured its own NIC (didn’t have cross NIC teaming) 
> * A monitors placed on servers and using Intel DC S3520 disks 
> 
> Ceph cluster config: 
> * CRUSHMAP failure domain = RACK 
> * Using bluestore, rocksdb on the same OSD disks, not separated. 
> * Bluestore compression not used 
> * Using crc32c sum type 
> 
> We began to investigate the problem and found a "towers" from nested functions or shared objects in flamegraphs, with a height of 110-120 blocks. 
> At the top of these "towers" an ceph's functions actually work [1]. 
> 
> I observe such an amount of nested functions for the first time. 
> 
> It also turned out that literally 50% of CPUs, an OSDs spends on network interaction (msg-worker-0,1,2). 
> 
> All perf data we get like that: perf record -g -p <PID> -- sleep 60. On already working OSD's processes. 
> 
> It's normal or we found something working wrong? Or just wrong hardware selection and limited by CPU/NVMe? 
> 
> Links 
> [1] https://drive.google.com/open?id=1m8aB0TTFJudyd1gGzNJ1_w6zr6LNgcjy - OSD0_RW 
> [2] https://drive.google.com/open?id=1JDr7rLRxSAwP3LMZihAtCXFQqidqIPY_ - OSD0_RR_4k_16jobs_1qd 
> [3] https://drive.google.com/open?id=19EiHAiQ4OrqhBcImb1QhQaeoIu0XXjVV - OSD0_RR_4k_64jobs_128qd 
> [4] https://drive.google.com/open?id=1-6pVEzpkz76fEBl9x8eF242h3Eksx52D. - Perf record data, process OSD, "perf record -g -p <PID> -- sleep 60» 
> 
> Ceph/Bluestore configs: 
> [global] 
> bluestore csum type = crc32c 
> bluefs buffered io = true 
> 
> ### Async Messenger 
> ms tcp read timeout = 120 
> ms async transport type = posix 
> ms async set affinity = true 
> ms async affinity cores = 20,21,22,23 
> 
> [osd] 
> bluestore cache size = 1073741824 
> bluestore cache size ssd = 1073741824 
> bluestore compression algorithm = snappy 
> bluestore compression mode = none 
> bluestore cache kv max = 1G 
> bluestore rocksdb options = compression=kNoCompression,recycle_log_file_num=64,compaction_style=kCompactionStyleLevel,write_buffer_size=4MB,target_file_size_base=4MB,max_background_compactions=64,level0_file_num_compaction_trigger=64,level0_slowdown_writes_trigger=128,level0_stop_writes_trigger=256,compaction_threads=32,flusher_threads=8,compaction_readahead_size=2MB 
> osd op num shards = 8 
> osd op num threads per shard = 2 
> 
> — 
> Mike, runs! 
> 
> 
> 
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majordomo@xxxxxxxxxxxxxxx 
> More majordomo info at http://vger.kernel.org/majordomo-info.html 
> 

— 
Mike, runs!

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html