Re: Low random write performance on All-NVMe Ceph cluster

Mark Nelson <mark.a.nelson@xxxxxxxxx> · Thu, 9 Aug 2018 08:10:58 -0500

Hi Mike,

The first thing I would try is bumping up the bluestore cache size.  No 
guarantee, but it's easy to change and you've got tons of memory. 
Usually during 4K random writes, bluestore is throttled by the kv sync 
thread and fetching onodes from rocksdb cache (or worse, disk) can 
really hurt 4k random write performance.  If your tests are to a large 
RBD volume, 1GB of bluestore cache is going to slow you down on NVMe.

With 96GB of RAM divided between 6 OSDs, you should be able to give each 
bluestore instance more like 4-8GB of cache with room to spare.  FWIW, 
I've been working on making the OSD smarter about how it divies up 
memory, so eventually you'll no longer need to set mins/maxes or ratios 
for the different caches, just an OSD target memory size.  The PRs for 
that are still being merged though.

You might also want to try out either my or adam's wallclock profiler to 
see where real time (instead of cpu time) is being spent:

https://github.com/markhpc/gdbpmp

https://github.com/aclamk/wallclock

Mark

On 08/08/2018 08:25 PM, Mike A wrote:
Hello!

Recently, we have assembled and configured our new a ceph cluster that using NMMe disks.
By now this cluster have only 3 servers with 6 NVMe Intel DC P4500 4Tb disks on each, total 18 disks.
We were faced with a problem that on tests 100% Random Write 4k blocks, the total cluster performance is equal to no more than 65000 IOPS.
In operation on this test, the cluster actually CPU limiting, loading each core is about 90%.

Ceph cluster hardware (each server have):
* 2 x CPU Intel Xeon Gold 6148 20core/2,4Ghz, maximum perfomance settings, HT enabled
* 96Gb RAM
* 6 x NVMe Intel DC P4500 4Tb disks, LBA formatted to 4k bloks, 3 disks per each NUMA node
* 2 x NIC Mellanox ConnectX-4 Lx EN 2port 25Gbit
* 2 x SATA SSD Intel DC S3520 for OS boot and the MON's base location

All servers installed in 3 Racks, using same 2 switches Arista with 100Gbit ports.

Ceph cluster software configuration (each server have):
* Ceph Luminous v12.2.7
* Centos 7.5.1804 with latest updates
* Kernel is 3.10.0-862.9.1
* Each NVMe disks have 4 partitions for 4 OSD processes. In total each server have 24 OSD, 4 per NVMe disk.
* Configured 2 team interfaces, for separate a client and a cluster traffic. Each team interfaces configured its own NIC (didn’t have cross NIC teaming)
* A monitors placed on servers and using Intel DC S3520 disks

Ceph cluster config:
* CRUSHMAP failure domain = RACK
* Using bluestore, rocksdb on the same OSD disks, not separated.
* Bluestore compression not used
* Using crc32c sum type

We began to investigate the problem and found a "towers" from nested functions or shared objects in flamegraphs, with a height of 110-120 blocks.
At the top of these "towers" an ceph's functions actually work [1].

I observe such an amount of nested functions for the first time.

It also turned out that literally 50% of CPUs, an OSDs spends on network interaction (msg-worker-0,1,2).

All perf data we get like that: perf record -g -p <PID> -- sleep 60. On already working OSD's processes.

It's normal or we found something working wrong? Or just wrong hardware selection and limited by CPU/NVMe?

Links
[1] https://drive.google.com/open?id=1m8aB0TTFJudyd1gGzNJ1_w6zr6LNgcjy   - OSD0_RW
[2] https://drive.google.com/open?id=1JDr7rLRxSAwP3LMZihAtCXFQqidqIPY_   - OSD0_RR_4k_16jobs_1qd
[3] https://drive.google.com/open?id=19EiHAiQ4OrqhBcImb1QhQaeoIu0XXjVV   - OSD0_RR_4k_64jobs_128qd
[4] https://drive.google.com/open?id=1-6pVEzpkz76fEBl9x8eF242h3Eksx52D.     - Perf record data, process OSD, "perf record -g -p <PID> -- sleep 60»

Ceph/Bluestore configs:
[global]
bluestore csum type = crc32c
bluefs buffered io = true

### Async Messenger
ms tcp read timeout = 120
ms async transport type = posix
ms async set affinity = true
ms async affinity cores = 20,21,22,23

[osd]
bluestore cache size = 1073741824
bluestore cache size ssd = 1073741824
bluestore compression algorithm = snappy
bluestore compression mode = none
bluestore cache kv max = 1G
bluestore rocksdb options = compression=kNoCompression,recycle_log_file_num=64,compaction_style=kCompactionStyleLevel,write_buffer_size=4MB,target_file_size_base=4MB,max_background_compactions=64,level0_file_num_compaction_trigger=64,level0_slowdown_writes_trigger=128,level0_stop_writes_trigger=256,compaction_threads=32,flusher_threads=8,compaction_readahead_size=2MB
osd op num shards = 8
osd op num threads per shard = 2

—
Mike, runs!

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html