Re: Pinpointing performance bottleneck / would SSD journals help?

Nick Fisk <nick@xxxxxxxxxx> · Mon, 27 Jun 2016 21:35:35 +0100

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Daniel Schneller
> Sent: 27 June 2016 17:33
> To: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  Pinpointing performance bottleneck / would SSD
> journals help?
> 
> On 2016-06-27 16:01:07 +0000, Lionel Bouton said:
> 
> > Le 27/06/2016 17:42, Daniel Schneller a écrit :
> >> Hi!
> >>
> >> * Network Link saturation.
> >> All links / bonds are well below any relevant load (around 35MB/s or
> >> less)
> > ...
> > Or you sure ? On each server you have 12 OSDs with a theoretical
> > bandwidth of at least half of 100MB/s (minimum bandwidth of any
> > reasonable HDD but halved because of the journal on the same device).
> > Which means your total disk bandwidth per server is 600MB/s.
> 
> Correct. However, I fear that because of lots of random IO going on, we
> won't be coming anywhere near that number, esp. with 3x replication.
> 
> > Bonded links are not perfect aggregation (depending on the mode one
> > client will either always use the same link or have its traffic
> > imperfectly balanced between the 2), so your theoretical network
> > bandwidth is probably nearest to 1Gbps (~ 120MB/s).
> 
> We use layer3+4 to spread traffic based on sources and destination IP and
> port information. Benchmarks have shown that using enough parallel
> streams we can saturate the full 250MB/s this ideally produces. You are
right,
> of course, that any single TCP connection will never exceed 1Gbps.
> 
> > What could happen is that the 35MB/s is an average over a large period
> > (several seconds), it's probably peaking at 120MB/s during short bursts.
> 
> That thought crossed my mind early on, too, but these values are based on
> /proc/net/dev which has counters for each network device. The statistics
are
> gathered by checking the difference between the current sample and the
> last. So this does not suffer from samples being taken at relatively long
> intervals.
> 
> > I wouldn't use less than 10Gbps for both the cluster and public
> > networks in your case.
> 
> I whole-heartedly agree... Certainly sensible, but for now we have to make
> due with the infrastructure we have. Still, based on the data we have so
far,
> the network at least doesn't jump at me as a (major) contributor to the
> slowness we see in this current scenario.
> 
> 
> > You didn't say how many VMs are running : the rkB/s and wkB/s seem
> > very low (note that for write intensive tasks your VM is reading quite
> > a
> > bit...) but if you have 10 VMs or more battling for read and write
> > access this way it wouldn't be unexpected. As soon as latency rises
> > for one reason or another (here it would be network latency) you can
> > expect the total throughput of random accesses to plummet.
> 
> In total there are about 25 VMs, however many of them are less I/O bound
> than MongoDB and Elasticsearch.  As for the comparatively high read load,
I
> agree, but I cannot really explain that in detail at the moment.
> 
> In general I would be very much interested in diagnosing the underlying
bare
> metal layer without making too many assumptions about what clients are
> actually doing. In this case we can look into the VMs, but in general it
would
> be ideal to pinpoint a bottleneck on the "lower" levels. Any improvements
> there would be beneficial to all client software.
> 

You need to run iostat on the OSD nodes themselves and see what the disks
are doing. You stated that they are doing ~180iops per disk, which suggests
they are highly saturated and likely to be the cause of the problem. I'm
guessing you will also see really high queue depths per disk, which normally
is the cause of high latency.

If you add SSD journals and a large amount of your IO is writes, then you
may see an improvement. But you may also be at the point where you just need
more disks to be able to provide the required performance.

> Cheers,
> Daniel
> 
> 
> --
> Daniel Schneller
> Principal Cloud Engineer
> 
> CenterDevice GmbH
> https://www.centerdevice.de
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com