Re: rados block on SSD - performance - how to tune and get insight?

jesper@xxxxxxxx · Thu, 7 Feb 2019 09:52:34 +0100

> On Thu, 7 Feb 2019 08:17:20 +0100 jesper@xxxxxxxx wrote:
>> Hi List
>>
>> We are in the process of moving to the next usecase for our ceph cluster
>> (Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and
>> that works fine.
>>
>> We're currently on luminous / bluestore, if upgrading is deemed to
>> change what we're seeing then please let us know.
>>
>> We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each.
>> Connected
>> through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to
>> deadline, nomerges = 1, rotational = 0.
>>
> I'd make sure that the endurance of these SSDs is in line with your
> expected usage.

They are - at the moment :-) and Ceph allows me to change my mind without
interferrring with the applications running on top - Nice!

>> Each disk "should" give approximately 36K IOPS random write and the
>> double
>> random read.
>>
> Only locally, latency is your enemy.
>
> Tell us more about your network.

It is a Dell N4032, N4064 switch stack on 10Gbase-T.
All hosts are on same subnet, NIC's are Intel X540's
No-jumbo-framing and not much tuning - all kernels are on 4.15 (Ubuntu)

Pings from client to two of the osd's
--- flodhest.nzcorp.net ping statistics ---
50 packets transmitted, 50 received, 0% packet loss, time 50157ms
rtt min/avg/max/mdev = 0.075/0.105/0.158/0.021 ms
--- bison.nzcorp.net ping statistics ---
50 packets transmitted, 50 received, 0% packet loss, time 50139ms
rtt min/avg/max/mdev = 0.078/0.137/0.275/0.032 ms

> rados bench is not the sharpest tool in the shed for this.
> As it needs to allocate stuff to begin with, amongst other things.

Suggest longer test-runs?

>> This is also quite far from expected. I have 12GB of memory on the OSD
>> daemon for caching on each host - close to idle cluster - thus 50GB+ for
>> caching with a working set of < 6GB .. this should - in this case
>> not really be bound by the underlying SSD.
> Did you adjust the bluestore parameters (whatever they are this week or
> for your version) to actually use that memory?

According to top - it is picking up the caching memory.
We have this block.

bluestore_cache_kv_max = 214748364800
bluestore_cache_kv_ratio = 0.4
bluestore_cache_meta_ratio = 0.1
bluestore_cache_size_hdd = 13958643712
bluestore_cache_size_ssd = 13958643712
bluestore_rocksdb_options =
compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,compact_on_mount=false

I actually think most of above has been applied with the 10TB harddrives
in mind, not the SSD's .. but I have no idea if they do "bad things" for
us.

> Don't use iostat, use atop.
> Small IOPS are extremely CPU intensive, so atop will give you an insight
> as to what might be busy besides the actual storage device.

Thanks will do so.

More suggestions are wellcome.

Doing some math:
Say network latency was the only cost driver - assume rone roundtrip per
IOPS per thread.

16 threads - 0.15ms per round-trip - gives 1000 ms/s/thread / 0.15ms/IOPS
=> 6.666 IOPSs * 16 threads => 106666 IOPS/s
ok, thats at least an upper bound on expectations in this scenario, and I
am at 28207 thus 4x from and have
still not accounted any OSD or rdb userspace time into the equation.

Can i directly get service-time out of the osd-daemon ? That would be nice
to see how many ms is spend at that end from an OSD perspective.

Jesper

-- 
Jesper

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com