On Thu, 7 Aug 2014 14:21:30 +0200 Mariusz Gronczewski wrote: > > > > > > > I've often wished for some sort of bottleneck finder for ceph. An > > > easy way for the system to say where it is experiencing critical > > > latencies e.g. network, journals, osd data disks, etc. This would > > > assist troubleshooting and initial deployments immensely. > > > > As mentioned above, it's tricky. > > Most certainly desirable, but the ole Mark I eyeball and wetware is > > quite good at spotting these when presented with appropriate input > > like atop. > > > > Is there any stats from OSD perf dump that could help with that ? Help, yes. By itself, no. Firstly those values are transient, so need to be sampled frequently and put into the right correlation. I suppose if the OP had used "ceph osd perf" during the tests, spotting the suspect OSD and confirming it with atop or iostat might have been quicker. > I've wrote simple wrapper to collectd to get op_ and subop_ > rw/w/r_latency but I'm not certain if it will show problems with > underlying storage, so far everytime I evicted "slow" (3-5x times > bigger latency than other ones) osd another took its place. > > I'm guessing probably because that OSD got "short end of the CRUSH" and > got loaded with a bit more requests so other OSDs were waiting for that > one. > If the problem, hotspot just migrates to another OSD that is pretty probable and a reasonable assumption. How many OSDs are we talking about in your case? I got a 2 node cluster, each with 2 OSDs, SSD journals backed by 11 disk RAID6 behind a 4GB HW cache RAID controller. So things normally look very impressive like this (average of 1MBs/200IOPS at the time): --- # ceph osd perf osdid fs_commit_latency(ms) fs_apply_latency(ms) 0 22 2 1 23 4 2 24 3 3 22 5 --- In this particular setup, knowing very well the capabilities of the hardware involved, if OSDs would vary slightly (10% or so) it is probably PG imbalance (bad luck of the CRUSH draw). Not surprising with 4 OSDs and depending on the test or use case something I've seen and could reproduce. A much larger imbalance would suggest a wonky HDD in one of the OSD RAID sets or a RAID rebuild (something I would of course know already about). ^o^ In the case of the OP the problem stuck to a specific OSD (and probably would have been verifiable with speed tests of that disk) and went away when it got removed. Ceph could (using probably not insignificant computational resources) take into account all the ops issued to an OSD in <timeframe> and then put the performance numbers in a relation to it. So if with 10 OSDs 9 got 5% of all ops and one got 55% in the sample period, crappy latency is to be expected and should be corrected for. If however the distribution was equal AND the hardware is equal (something Ceph is likely never to know) then sanitized performance counters would pick up a slow OSD easily. > Also, is there any way to correlate results of > dump_historic_ops between OSDs ? I've noticed that in my case longest > one are usually "waiting for subops from X, Y" and except for time there > is no other information to correlate that for example op on osd.1 waited > for subop on osd.5 and that subop on osd.5 was slow because of y No idea, this calls for the Ceph engineers. ^o^ Christian -- Christian Balzer Network/Systems Engineer chibi at gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/