Re: Low random write performance on All-NVMe Ceph cluster

Alexandre DERUMIER <aderumier@xxxxxxxxx> · Thu, 9 Aug 2018 12:45:00 +0200 (CEST)

Hi,

I don't have benched bluestore yet,
but with filestore, with same kind of setup (3 nodes, 24cores-3ghz=, I'm able to reach around 200k iops (cpu limited too), with ssd or nvme.
But I have 1 disk - 1 osd. I'm not sure of impact to have 4osd for 1 nvme.

----- Mail original -----
De: "Mike A" <mike.almateia@xxxxxxxxx>
À: "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx>
Envoyé: Jeudi 9 Août 2018 03:25:58
Objet: Low random write performance on All-NVMe Ceph cluster

Hello! 

Recently, we have assembled and configured our new a ceph cluster that using NMMe disks. 
By now this cluster have only 3 servers with 6 NVMe Intel DC P4500 4Tb disks on each, total 18 disks. 
We were faced with a problem that on tests 100% Random Write 4k blocks, the total cluster performance is equal to no more than 65000 IOPS. 
In operation on this test, the cluster actually CPU limiting, loading each core is about 90%. 

Ceph cluster hardware (each server have): 
* 2 x CPU Intel Xeon Gold 6148 20core/2,4Ghz, maximum perfomance settings, HT enabled 
* 96Gb RAM 
* 6 x NVMe Intel DC P4500 4Tb disks, LBA formatted to 4k bloks, 3 disks per each NUMA node 
* 2 x NIC Mellanox ConnectX-4 Lx EN 2port 25Gbit 
* 2 x SATA SSD Intel DC S3520 for OS boot and the MON's base location 

All servers installed in 3 Racks, using same 2 switches Arista with 100Gbit ports. 

Ceph cluster software configuration (each server have): 
* Ceph Luminous v12.2.7 
* Centos 7.5.1804 with latest updates 
* Kernel is 3.10.0-862.9.1 
* Each NVMe disks have 4 partitions for 4 OSD processes. In total each server have 24 OSD, 4 per NVMe disk. 
* Configured 2 team interfaces, for separate a client and a cluster traffic. Each team interfaces configured its own NIC (didn’t have cross NIC teaming) 
* A monitors placed on servers and using Intel DC S3520 disks 

Ceph cluster config: 
* CRUSHMAP failure domain = RACK 
* Using bluestore, rocksdb on the same OSD disks, not separated. 
* Bluestore compression not used 
* Using crc32c sum type 

We began to investigate the problem and found a "towers" from nested functions or shared objects in flamegraphs, with a height of 110-120 blocks. 
At the top of these "towers" an ceph's functions actually work [1]. 

I observe such an amount of nested functions for the first time. 

It also turned out that literally 50% of CPUs, an OSDs spends on network interaction (msg-worker-0,1,2). 

All perf data we get like that: perf record -g -p <PID> -- sleep 60. On already working OSD's processes. 

It's normal or we found something working wrong? Or just wrong hardware selection and limited by CPU/NVMe? 

Links 
[1] https://drive.google.com/open?id=1m8aB0TTFJudyd1gGzNJ1_w6zr6LNgcjy - OSD0_RW 
[2] https://drive.google.com/open?id=1JDr7rLRxSAwP3LMZihAtCXFQqidqIPY_ - OSD0_RR_4k_16jobs_1qd 
[3] https://drive.google.com/open?id=19EiHAiQ4OrqhBcImb1QhQaeoIu0XXjVV - OSD0_RR_4k_64jobs_128qd 
[4] https://drive.google.com/open?id=1-6pVEzpkz76fEBl9x8eF242h3Eksx52D. - Perf record data, process OSD, "perf record -g -p <PID> -- sleep 60» 

Ceph/Bluestore configs: 
[global] 
bluestore csum type = crc32c 
bluefs buffered io = true 

### Async Messenger 
ms tcp read timeout = 120 
ms async transport type = posix 
ms async set affinity = true 
ms async affinity cores = 20,21,22,23 

[osd] 
bluestore cache size = 1073741824 
bluestore cache size ssd = 1073741824 
bluestore compression algorithm = snappy 
bluestore compression mode = none 
bluestore cache kv max = 1G 
bluestore rocksdb options = compression=kNoCompression,recycle_log_file_num=64,compaction_style=kCompactionStyleLevel,write_buffer_size=4MB,target_file_size_base=4MB,max_background_compactions=64,level0_file_num_compaction_trigger=64,level0_slowdown_writes_trigger=128,level0_stop_writes_trigger=256,compaction_threads=32,flusher_threads=8,compaction_readahead_size=2MB 
osd op num shards = 8 
osd op num threads per shard = 2 

— 
Mike, runs! 

-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@xxxxxxxxxxxxxxx 
More majordomo info at http://vger.kernel.org/majordomo-info.html 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html