> On Thu, 7 Feb 2019 08:17:20 +0100 jesper@xxxxxxxx wrote: >> Hi List >> >> We are in the process of moving to the next usecase for our ceph cluster >> (Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and >> that works fine. >> >> We're currently on luminous / bluestore, if upgrading is deemed to >> change what we're seeing then please let us know. >> >> We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each. >> Connected >> through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to >> deadline, nomerges = 1, rotational = 0. >> > I'd make sure that the endurance of these SSDs is in line with your > expected usage. They are - at the moment :-) and Ceph allows me to change my mind without interferrring with the applications running on top - Nice! >> Each disk "should" give approximately 36K IOPS random write and the >> double >> random read. >> > Only locally, latency is your enemy. > > Tell us more about your network. It is a Dell N4032, N4064 switch stack on 10Gbase-T. All hosts are on same subnet, NIC's are Intel X540's No-jumbo-framing and not much tuning - all kernels are on 4.15 (Ubuntu) Pings from client to two of the osd's --- flodhest.nzcorp.net ping statistics --- 50 packets transmitted, 50 received, 0% packet loss, time 50157ms rtt min/avg/max/mdev = 0.075/0.105/0.158/0.021 ms --- bison.nzcorp.net ping statistics --- 50 packets transmitted, 50 received, 0% packet loss, time 50139ms rtt min/avg/max/mdev = 0.078/0.137/0.275/0.032 ms > rados bench is not the sharpest tool in the shed for this. > As it needs to allocate stuff to begin with, amongst other things. Suggest longer test-runs? >> This is also quite far from expected. I have 12GB of memory on the OSD >> daemon for caching on each host - close to idle cluster - thus 50GB+ for >> caching with a working set of < 6GB .. this should - in this case >> not really be bound by the underlying SSD. > Did you adjust the bluestore parameters (whatever they are this week or > for your version) to actually use that memory? According to top - it is picking up the caching memory. We have this block. bluestore_cache_kv_max = 214748364800 bluestore_cache_kv_ratio = 0.4 bluestore_cache_meta_ratio = 0.1 bluestore_cache_size_hdd = 13958643712 bluestore_cache_size_ssd = 13958643712 bluestore_rocksdb_options = compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,compact_on_mount=false I actually think most of above has been applied with the 10TB harddrives in mind, not the SSD's .. but I have no idea if they do "bad things" for us. > Don't use iostat, use atop. > Small IOPS are extremely CPU intensive, so atop will give you an insight > as to what might be busy besides the actual storage device. Thanks will do so. More suggestions are wellcome. Doing some math: Say network latency was the only cost driver - assume rone roundtrip per IOPS per thread. 16 threads - 0.15ms per round-trip - gives 1000 ms/s/thread / 0.15ms/IOPS => 6.666 IOPSs * 16 threads => 106666 IOPS/s ok, thats at least an upper bound on expectations in this scenario, and I am at 28207 thus 4x from and have still not accounted any OSD or rdb userspace time into the equation. Can i directly get service-time out of the osd-daemon ? That would be nice to see how many ms is spend at that end from an OSD perspective. Jesper -- Jesper _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com