Re: What are you doing to locate performance issues in a Ceph cluster?

Chris Kitzmiller <ckitzmiller@xxxxxxxxxxxxx> · Tue, 7 Apr 2015 00:36:12 -0400

On Apr 6, 2015, at 7:04 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
I see that ceph has 'ceph osd perf' that gets the latency of the OSDs.
Is there a similar command that would provide some performance data
about RBDs in use? I'm concerned about out ability to determine which
RBD(s) may be "abusing" our storage at any given time.

What are others doing to locate performance issues in their Ceph clusters?

I graph aggregate stats for `ceph --admin-daemon /var/run/ceph/ceph-osd.$osdid.asok perf dump`. If the max latency strays too far outside of my mean latency I know to go look for the troublemaker. My graphs look something like this:

So on Thursday just before noon a drive dies. The blue min latency for all disks spikes up because all disks are recovering the data on the lost OSD. The min drops back down to normal pretty quickly but then the red max line spikes way up for that single new disk which replaced the dead drive. It stays pretty high until it is done moving data back to itself at which time it becomes normal again just before midnight.

I do this style of graphing because I have 30 OSDs per chassis and a chart with 30 individual lines on it would be kind of tough to read. Though on less dense nodes that would probably be the way to go.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com