Bluestore caching oddities, again

Christian Balzer <chibi@xxxxxxx> · Sun, 4 Aug 2019 10:47:27 +0900

Hello,

preparing the first production bluestore, nautilus (latest) based cluster
I've run into the same things other people and myself ran into before.

Firstly HW, 3 nodes with 12 SATA HDDs each, IT mode LSI 3008, wal/db on
40GB SSD partitions. (boy do I hate the inability of ceph-volume to deal
with raw partitions).
SSDs aren't a bottleneck in any scenario.
Single E5-1650 v3 @ 3.50GHz, cpu isn't a bottleneck in any scenario, less
than 15% of a core per OSD.

Connection is via 40GB/s infiniband, IPoIB, no issues here as numbers later
will show.

Clients are KVMs on Epyc based compute nodes, maybe some more speed could
be squeezed out here with different VM configs, but the cpu isn't an issue
in the problem cases.

1. 4k random I/O can cause degraded PGs
I've run into the same/similar issue as Nathan Fish here:
https://www.spinics.net/lists/ceph-users/msg526
During the first 2 tests with 4k random I/O I got shortly degraded PGs as
well, with no indication in CPU or SSD utilization accounting for this.
HDDs were of course busy at that time.
Wasn't able to reproduce this so far, but it leaves me less than
confident. 

2. Bluestore caching still broken
When writing data with the fios below, it isn't cached on the OSDs.
Worse, existing cached data that gets overwritten is removed from the
cache, which while of course correct can't be free in terms of allocation
overhead. 
Why not doing what any sensible person would expect from experience with
any other cache there is, cache writes in case the data gets read again
soon and in case of overwrites use existing allocations.

3. Read performance abysmal with direct=0

It's nearly an order of magnitude slower, no indication why, certainly
not resource starvation.

FIO command line:
fio --size=32G --ioengine=libaio --invalidate=1 --direct=0 --numjobs=1 --rw=read --name=fiojob --blocksize=4M --iodepth=64

Writes with direct=0:
  write: IOPS=119, BW=479MiB/s (503MB/s)(32.0GiB/68348msec)

with direct=1
  write: IOPS=139, BW=560MiB/s (587MB/s)(32.0GiB/58519msec)

Not as bad as with reads, but still an inexplicable difference.
FYI, rados bench gets 750MB/s 

Reads with from a cold cache with direct=0
  read: IOPS=40, BW=163MiB/s (171MB/s)(7556MiB/46320msec) 
(gave up, it does not get better)

with direct=1
  read: IOPS=314, BW=1257MiB/s (1318MB/s)(32.0GiB/26063msec)

Reads from a hot cache with direct=0
  read: IOPS=199, BW=797MiB/s (835MB/s)(32.0GiB/41130msec)

with direct=1
  read: IOPS=702, BW=2810MiB/s (2946MB/s)(32.0GiB/11662msec)
Which is as fast as gets with this setup.

Comments?

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Rakuten Mobile Inc.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com