Re: Low random write performance on All-NVMe Ceph cluster

Mark Nelson <mark.a.nelson@xxxxxxxxx> · Thu, 9 Aug 2018 12:04:00 -0500

On 08/09/2018 11:48 AM, Mike A wrote:
Hello!

9 авг. 2018 г., в 19:26, Mark Nelson <mark.a.nelson@xxxxxxxxx> написал(а):

On 08/09/2018 10:29 AM, Mike A wrote:
Hello.
9 авг. 2018 г., в 16:10, Mark Nelson <mark.a.nelson@xxxxxxxxx> написал(а):

Hi Mike,

The first thing I would try is bumping up the bluestore cache size.  No guarantee, but it's easy to change and you've got tons of memory. Usually during 4K random writes, bluestore is throttled by the kv sync thread and fetching onodes from rocksdb cache (or worse, disk) can really hurt 4k random write performance.  If your tests are to a large RBD volume, 1GB of bluestore cache is going to slow you down on NVMe.

With 96GB of RAM divided between 6 OSDs, you should be able to give each bluestore instance more like 4-8GB of cache with room to spare.  FWIW, I've been working on making the OSD smarter about how it divies up memory, so eventually you'll no longer need to set mins/maxes or ratios for the different caches, just an OSD target memory size.  The PRs for that are still being merged though.
I have 4 OSD process per each NVMe disk. By now each OSD process eaten 3,4Gb VIRT and 1,6 Gb RES RAM: 13,6 and 6,4 per NVMe disk.
On console:
# free -mh
               total        used        free      shared  buff/cache   available
Mem:        92G         42G        575M         27M         50G         49G

Probably the most important thing on the performance side is to try to keep onodes in bluestore's meta cache while not swapping out any of the rocksdb index/filter blocks.  With 24 OSDs, I imagine you should still be able to do 2GB of bluestore cache per OSD.  Personally I'd give the majority of that to bluestore meta cache.

I will try.

Do you think an increase the bluestore_cache will not lead to the use of the swap area?

It's possible it could during a recovery scenario, but that's more a function of recovery simply taking a ton of memory.  Neha did some work to bound this in master which I'm hoping will improve things.

Or, for example, I can reduce the number of OSDs per disk to 2 and increase the bluestore_cache.

You could.  Depending on how many PGs per OSD you are targeting, this might reduce overall memory consumption, but you'd get less parallelism benefit from multiple OSDs.

For now I'd be tempted to just try bumping the bluestore cache size up to 2GB per OSD and making sure the ratios are set so that a large portion of the cache is for meta cache and see if it makes any difference.  It's a quick change and you don't have to rebuild your cluster (and can just set it back if you don't like it).

That "a large portion" for meta cache it’s how much: .10, .20, .65?

It's sort of a balancing act (why we are automatically tuning these in 
the future).  I'd go with a 2GB cache and maybe .75 or .8 for meta (or 
even higher, though you might suffer if you go too high and rocksdb 
indexes/filters get pushed out of cache).

One clue that this might help is if you are seeing a lot of reads from 
disk during writes.

Mark

You might also want to try out either my or adam's wallclock profiler to see where real time (instead of cpu time) is being spent:

https://github.com/markhpc/gdbpmp

https://github.com/aclamk/wallclock
Very interesting! I'll definitely try!

Mark

On 08/08/2018 08:25 PM, Mike A wrote:
Hello!
Recently, we have assembled and configured our new a ceph cluster that using NMMe disks.
By now this cluster have only 3 servers with 6 NVMe Intel DC P4500 4Tb disks on each, total 18 disks.
We were faced with a problem that on tests 100% Random Write 4k blocks, the total cluster performance is equal to no more than 65000 IOPS.
In operation on this test, the cluster actually CPU limiting, loading each core is about 90%.
Ceph cluster hardware (each server have):
* 2 x CPU Intel Xeon Gold 6148 20core/2,4Ghz, maximum perfomance settings, HT enabled
* 96Gb RAM
* 6 x NVMe Intel DC P4500 4Tb disks, LBA formatted to 4k bloks, 3 disks per each NUMA node
* 2 x NIC Mellanox ConnectX-4 Lx EN 2port 25Gbit
* 2 x SATA SSD Intel DC S3520 for OS boot and the MON's base location
All servers installed in 3 Racks, using same 2 switches Arista with 100Gbit ports.
Ceph cluster software configuration (each server have):
* Ceph Luminous v12.2.7
* Centos 7.5.1804 with latest updates
* Kernel is 3.10.0-862.9.1
* Each NVMe disks have 4 partitions for 4 OSD processes. In total each server have 24 OSD, 4 per NVMe disk.
* Configured 2 team interfaces, for separate a client and a cluster traffic. Each team interfaces configured its own NIC (didn’t have cross NIC teaming)
* A monitors placed on servers and using Intel DC S3520 disks
Ceph cluster config:
* CRUSHMAP failure domain = RACK
* Using bluestore, rocksdb on the same OSD disks, not separated.
* Bluestore compression not used
* Using crc32c sum type
We began to investigate the problem and found a "towers" from nested functions or shared objects in flamegraphs, with a height of 110-120 blocks.
At the top of these "towers" an ceph's functions actually work [1].
  I observe such an amount of nested functions for the first time.
It also turned out that literally 50% of CPUs, an OSDs spends on network interaction (msg-worker-0,1,2).
All perf data we get like that: perf record -g -p <PID> -- sleep 60. On already working OSD's processes.
It's normal or we found something working wrong? Or just wrong hardware selection and limited by CPU/NVMe?
Links
[1] https://drive.google.com/open?id=1m8aB0TTFJudyd1gGzNJ1_w6zr6LNgcjy   - OSD0_RW
[2] https://drive.google.com/open?id=1JDr7rLRxSAwP3LMZihAtCXFQqidqIPY_   - OSD0_RR_4k_16jobs_1qd
[3] https://drive.google.com/open?id=19EiHAiQ4OrqhBcImb1QhQaeoIu0XXjVV   - OSD0_RR_4k_64jobs_128qd
[4] https://drive.google.com/open?id=1-6pVEzpkz76fEBl9x8eF242h3Eksx52D.     - Perf record data, process OSD, "perf record -g -p <PID> -- sleep 60»
Ceph/Bluestore configs:
[global]
bluestore csum type = crc32c
bluefs buffered io = true
### Async Messenger
ms tcp read timeout = 120
ms async transport type = posix
ms async set affinity = true
ms async affinity cores = 20,21,22,23
[osd]
bluestore cache size = 1073741824
bluestore cache size ssd = 1073741824
bluestore compression algorithm = snappy
bluestore compression mode = none
bluestore cache kv max = 1G
bluestore rocksdb options = compression=kNoCompression,recycle_log_file_num=64,compaction_style=kCompactionStyleLevel,write_buffer_size=4MB,target_file_size_base=4MB,max_background_compactions=64,level0_file_num_compaction_trigger=64,level0_slowdown_writes_trigger=128,level0_stop_writes_trigger=256,compaction_threads=32,flusher_threads=8,compaction_readahead_size=2MB
osd op num shards = 8
osd op num threads per shard = 2
—
Mike, runs!
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html
—
Mike, runs!

—
Mike, runs!

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html