Low random write performance on All-NVMe Ceph cluster

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello!

Recently, we have assembled and configured our new a ceph cluster that using NMMe disks. 
By now this cluster have only 3 servers with 6 NVMe Intel DC P4500 4Tb disks on each, total 18 disks.  
We were faced with a problem that on tests 100% Random Write 4k blocks, the total cluster performance is equal to no more than 65000 IOPS.
In operation on this test, the cluster actually CPU limiting, loading each core is about 90%.

Ceph cluster hardware (each server have):
* 2 x CPU Intel Xeon Gold 6148 20core/2,4Ghz, maximum perfomance settings, HT enabled
* 96Gb RAM
* 6 x NVMe Intel DC P4500 4Tb disks, LBA formatted to 4k bloks, 3 disks per each NUMA node
* 2 x NIC Mellanox ConnectX-4 Lx EN 2port 25Gbit
* 2 x SATA SSD Intel DC S3520 for OS boot and the MON's base location

All servers installed in 3 Racks, using same 2 switches Arista with 100Gbit ports.

Ceph cluster software configuration (each server have):
* Ceph Luminous v12.2.7
* Centos 7.5.1804 with latest updates
* Kernel is 3.10.0-862.9.1
* Each NVMe disks have 4 partitions for 4 OSD processes. In total each server have 24 OSD, 4 per NVMe disk.
* Configured 2 team interfaces, for separate a client and a cluster traffic. Each team interfaces configured its own NIC (didn’t have cross NIC teaming)
* A monitors placed on servers and using Intel DC S3520 disks

Ceph cluster config:
* CRUSHMAP failure domain = RACK
* Using bluestore, rocksdb on the same OSD disks, not separated.
* Bluestore compression not used
* Using crc32c sum type

We began to investigate the problem and found a "towers" from nested functions or shared objects in flamegraphs, with a height of 110-120 blocks. 
At the top of these "towers" an ceph's functions actually work [1].
 
I observe such an amount of nested functions for the first time.

It also turned out that literally 50% of CPUs, an OSDs spends on network interaction (msg-worker-0,1,2).

All perf data we get like that: perf record -g -p <PID> -- sleep 60. On already working OSD's processes.

It's normal or we found something working wrong? Or just wrong hardware selection and limited by CPU/NVMe?

Links
[1] https://drive.google.com/open?id=1m8aB0TTFJudyd1gGzNJ1_w6zr6LNgcjy   - OSD0_RW
[2] https://drive.google.com/open?id=1JDr7rLRxSAwP3LMZihAtCXFQqidqIPY_   - OSD0_RR_4k_16jobs_1qd
[3] https://drive.google.com/open?id=19EiHAiQ4OrqhBcImb1QhQaeoIu0XXjVV   - OSD0_RR_4k_64jobs_128qd
[4] https://drive.google.com/open?id=1-6pVEzpkz76fEBl9x8eF242h3Eksx52D.     - Perf record data, process OSD, "perf record -g -p <PID> -- sleep 60» 

Ceph/Bluestore configs:
[global]
bluestore csum type = crc32c
bluefs buffered io = true

### Async Messenger 
ms tcp read timeout = 120
ms async transport type = posix
ms async set affinity = true
ms async affinity cores = 20,21,22,23

[osd]
bluestore cache size = 1073741824
bluestore cache size ssd = 1073741824
bluestore compression algorithm = snappy
bluestore compression mode = none
bluestore cache kv max = 1G
bluestore rocksdb options = compression=kNoCompression,recycle_log_file_num=64,compaction_style=kCompactionStyleLevel,write_buffer_size=4MB,target_file_size_base=4MB,max_background_compactions=64,level0_file_num_compaction_trigger=64,level0_slowdown_writes_trigger=128,level0_stop_writes_trigger=256,compaction_threads=32,flusher_threads=8,compaction_readahead_size=2MB
osd op num shards = 8
osd op num threads per shard = 2

— 
Mike, runs!




--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux