Hello, preparing the first production bluestore, nautilus (latest) based cluster I've run into the same things other people and myself ran into before. Firstly HW, 3 nodes with 12 SATA HDDs each, IT mode LSI 3008, wal/db on 40GB SSD partitions. (boy do I hate the inability of ceph-volume to deal with raw partitions). SSDs aren't a bottleneck in any scenario. Single E5-1650 v3 @ 3.50GHz, cpu isn't a bottleneck in any scenario, less than 15% of a core per OSD. Connection is via 40GB/s infiniband, IPoIB, no issues here as numbers later will show. Clients are KVMs on Epyc based compute nodes, maybe some more speed could be squeezed out here with different VM configs, but the cpu isn't an issue in the problem cases. 1. 4k random I/O can cause degraded PGs I've run into the same/similar issue as Nathan Fish here: https://www.spinics.net/lists/ceph-users/msg526 During the first 2 tests with 4k random I/O I got shortly degraded PGs as well, with no indication in CPU or SSD utilization accounting for this. HDDs were of course busy at that time. Wasn't able to reproduce this so far, but it leaves me less than confident. 2. Bluestore caching still broken When writing data with the fios below, it isn't cached on the OSDs. Worse, existing cached data that gets overwritten is removed from the cache, which while of course correct can't be free in terms of allocation overhead. Why not doing what any sensible person would expect from experience with any other cache there is, cache writes in case the data gets read again soon and in case of overwrites use existing allocations. 3. Read performance abysmal with direct=0 It's nearly an order of magnitude slower, no indication why, certainly not resource starvation. FIO command line: fio --size=32G --ioengine=libaio --invalidate=1 --direct=0 --numjobs=1 --rw=read --name=fiojob --blocksize=4M --iodepth=64 Writes with direct=0: write: IOPS=119, BW=479MiB/s (503MB/s)(32.0GiB/68348msec) with direct=1 write: IOPS=139, BW=560MiB/s (587MB/s)(32.0GiB/58519msec) Not as bad as with reads, but still an inexplicable difference. FYI, rados bench gets 750MB/s Reads with from a cold cache with direct=0 read: IOPS=40, BW=163MiB/s (171MB/s)(7556MiB/46320msec) (gave up, it does not get better) with direct=1 read: IOPS=314, BW=1257MiB/s (1318MB/s)(32.0GiB/26063msec) Reads from a hot cache with direct=0 read: IOPS=199, BW=797MiB/s (835MB/s)(32.0GiB/41130msec) with direct=1 read: IOPS=702, BW=2810MiB/s (2946MB/s)(32.0GiB/11662msec) Which is as fast as gets with this setup. Comments? Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Rakuten Mobile Inc. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com