Ceph writes stall for long perioids with no disk/network activity

chibi@xxxxxxx (Christian Balzer) · Thu, 7 Aug 2014 22:46:09 +0900

On Thu, 7 Aug 2014 14:21:30 +0200 Mariusz Gronczewski wrote:

> 
> > > 
> > > I've often wished for some sort of bottleneck finder for ceph. An
> > > easy way for the system to say where it is experiencing critical
> > > latencies e.g. network, journals, osd data disks, etc. This would
> > > assist troubleshooting and initial deployments immensely.
> > 
> > As mentioned above, it's tricky. 
> > Most certainly desirable, but the ole Mark I eyeball and wetware is
> > quite good at spotting these when presented with appropriate input
> > like atop.
> > 
> 
> Is there any stats from OSD perf dump that could help with that ?

Help, yes. By itself, no.
Firstly those values are transient, so need to be sampled frequently and
put into the right correlation.
I suppose if the OP had used "ceph osd perf" during the tests, spotting
the suspect OSD and confirming it with atop or iostat might have been
quicker.

> I've wrote simple wrapper to collectd to get op_ and subop_
> rw/w/r_latency but I'm not certain if it will show problems with
> underlying storage, so far everytime I evicted "slow" (3-5x times
> bigger latency than other ones) osd another took its place.
> 
> I'm guessing probably because that OSD got "short end of the CRUSH" and
> got loaded with a bit more requests so other OSDs were waiting for that
> one.
> 

If the problem, hotspot just migrates to another OSD that is pretty
probable and a reasonable assumption. 
How many OSDs are we talking about in your case?

I got a 2 node cluster, each with 2 OSDs, SSD journals backed by 11 disk
RAID6 behind a 4GB HW cache RAID controller. 
So things normally look very impressive like this (average of 1MBs/200IOPS
at the time):
---
# ceph osd perf
osdid fs_commit_latency(ms) fs_apply_latency(ms) 
    0                    22                    2 
    1                    23                    4 
    2                    24                    3 
    3                    22                    5 
---

In this particular setup, knowing very well the capabilities of the
hardware involved, if OSDs would vary slightly (10% or so) it is probably
PG imbalance (bad luck of the CRUSH draw). 
Not surprising with 4 OSDs and depending on the test or use case
something I've seen and could reproduce. 
A much larger imbalance would suggest a wonky HDD in one of the OSD RAID
sets or a RAID rebuild (something I would of course know already about).
^o^

In the case of the OP the problem stuck to a specific OSD (and probably
would have been verifiable with speed tests of that disk) and went away
when it got removed.

Ceph could (using probably not insignificant computational resources) take
into account all the ops issued to an OSD in <timeframe> and then put the
performance numbers in a relation to it. 
So if with 10 OSDs 9 got 5% of all ops and one got 55% in the sample
period, crappy latency is to be expected and should be corrected for.
If however the distribution was equal AND the hardware is equal (something
Ceph is likely never to know) then sanitized performance counters would
pick up a slow OSD easily.

> Also, is there any way to correlate results of
> dump_historic_ops between OSDs ? I've noticed that in my case longest
> one are usually "waiting for subops from X, Y" and except for time there
> is no other information to correlate that for example op on osd.1 waited
> for subop on osd.5 and that subop on osd.5 was slow because of y

No idea, this calls for the Ceph engineers. ^o^

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/