> > > > I've often wished for some sort of bottleneck finder for ceph. An easy > > way for the system to say where it is experiencing critical latencies > > e.g. network, journals, osd data disks, etc. This would assist > > troubleshooting and initial deployments immensely. > > As mentioned above, it's tricky. > Most certainly desirable, but the ole Mark I eyeball and wetware is quite > good at spotting these when presented with appropriate input like atop. > Is there any stats from OSD perf dump that could help with that ? I've wrote simple wrapper to collectd to get op_ and subop_ rw/w/r_latency but I'm not certain if it will show problems with underlying storage, so far everytime I evicted "slow" (3-5x times bigger latency than other ones) osd another took its place. I'm guessing probably because that OSD got "short end of the CRUSH" and got loaded with a bit more requests so other OSDs were waiting for that one. Also, is there any way to correlate results of dump_historic_ops between OSDs ? I've noticed that in my case longest one are usually "waiting for subops from X, Y" and except for time there is no other information to correlate that for example op on osd.1 waited for subop on osd.5 and that subop on osd.5 was slow because of y -- Mariusz Gronczewski, Administrator Efigence S. A. ul. Wo?oska 9a, 02-583 Warszawa T: [+48] 22 380 13 13 F: [+48] 22 380 13 14 E: mariusz.gronczewski at efigence.com <mailto:mariusz.gronczewski at efigence.com> -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 490 bytes Desc: not available URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140807/d35d7554/attachment.pgp>