Ceph writes stall for long perioids with no disk/network activity

mariusz.gronczewski@xxxxxxxxxxxx (Mariusz Gronczewski) · Thu, 7 Aug 2014 14:21:30 +0200

> > 
> > I've often wished for some sort of bottleneck finder for ceph. An easy
> > way for the system to say where it is experiencing critical latencies
> > e.g. network, journals, osd data disks, etc. This would assist
> > troubleshooting and initial deployments immensely.
> 
> As mentioned above, it's tricky. 
> Most certainly desirable, but the ole Mark I eyeball and wetware is quite
> good at spotting these when presented with appropriate input like atop.
> 

Is there any stats from OSD perf dump that could help with that ?
I've wrote simple wrapper to collectd to get op_ and subop_
rw/w/r_latency but I'm not certain if it will show problems with
underlying storage, so far everytime I evicted "slow" (3-5x times
bigger latency than other ones) osd another took its place.

I'm guessing probably because that OSD got "short end of the CRUSH" and
got loaded with a bit more requests so other OSDs were waiting for that
one.

Also, is there any way to correlate results of
dump_historic_ops between OSDs ? I've noticed that in my case longest one are usually "waiting for subops from X, Y" and except for time there is no other information to correlate that for example op on osd.1 waited for subop on osd.5 and that subop on osd.5 was slow because of y
-- 
Mariusz Gronczewski, Administrator

Efigence S. A.
ul. Wo?oska 9a, 02-583 Warszawa
T: [+48] 22 380 13 13
F: [+48] 22 380 13 14
E: mariusz.gronczewski at efigence.com
<mailto:mariusz.gronczewski at efigence.com>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 490 bytes
Desc: not available
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140807/d35d7554/attachment.pgp>