Re: Pinpointing performance bottleneck / would SSD journals help?

Daniel Schneller <daniel.schneller@xxxxxxxxxxxxxxxx> · Mon, 27 Jun 2016 18:33:10 +0200

On 2016-06-27 16:01:07 +0000, Lionel Bouton said:

Le 27/06/2016 17:42, Daniel Schneller a écrit :
Hi!

* Network Link saturation.
All links / bonds are well below any relevant load (around 35MB/s or
less)
...
Or you sure ? On each server you have 12 OSDs with a theoretical
bandwidth of at least half of 100MB/s (minimum bandwidth of any
reasonable HDD but halved because of the journal on the same device).
Which means your total disk bandwidth per server is 600MB/s.

Correct. However, I fear that because of lots of random IO going on,
we won't be coming anywhere near that number, esp. with 3x replication.

Bonded links are not perfect aggregation (depending on the mode one
client will either always use the same link or have its traffic
imperfectly balanced between the 2), so your theoretical network
bandwidth is probably nearest to 1Gbps (~ 120MB/s).

We use layer3+4 to spread traffic based on sources and destination
IP and port information. Benchmarks have shown that using enough
parallel streams we can saturate the full 250MB/s this ideally
produces. You are right, of course, that any single TCP connection
will never exceed 1Gbps.

What could happen is that the 35MB/s is an average over a large period
(several seconds), it's probably peaking at 120MB/s during short bursts.

That thought crossed my mind early on, too, but these values are based on
/proc/net/dev which has counters for each network device. The statistics
are gathered by checking the difference between the current sample and
the last. So this does not suffer from samples being taken at relatively
long intervals.

I wouldn't use less than 10Gbps for both the cluster and public networks
in your case.

I whole-heartedly agree... Certainly sensible, but for now we have to make
due with the infrastructure we have. Still, based on the data we have so far,
the network at least doesn't jump at me as a (major) contributor to the
slowness we see in this current scenario.

You didn't say how many VMs are running : the rkB/s and wkB/s seem very
low (note that for write intensive tasks your VM is reading quite a
bit...) but if you have 10 VMs or more battling for read and write
access this way it wouldn't be unexpected. As soon as latency rises for
one reason or another (here it would be network latency) you can expect
the total throughput of random accesses to plummet.

In total there are about 25 VMs, however many of them are less I/O bound
than MongoDB and Elasticsearch.  As for the comparatively high read load,
I agree, but I cannot really explain that in detail at the moment.

In general I would be very much interested in diagnosing the underlying
bare metal layer without making too many assumptions about what clients
are actually doing. In this case we can look into the VMs, but in general
it would be ideal to pinpoint a bottleneck on the "lower" levels. Any
improvements there would be beneficial to all client software.

Cheers,
Daniel

--
Daniel Schneller
Principal Cloud Engineer

CenterDevice GmbH
https://www.centerdevice.de

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com