On Apr 6, 2015, at 7:04 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:I see that ceph has 'ceph osd perf' that gets the latency of the OSDs. I graph aggregate stats for `ceph --admin-daemon /var/run/ceph/ceph-osd.$osdid.asok perf dump`. If the max latency strays too far outside of my mean latency I know to go look for the troublemaker. My graphs look something like this: So on Thursday just before noon a drive dies. The blue min latency for all disks spikes up because all disks are recovering the data on the lost OSD. The min drops back down to normal pretty quickly but then the red max line spikes way up for that single new disk which replaced the dead drive. It stays pretty high until it is done moving data back to itself at which time it becomes normal again just before midnight. I do this style of graphing because I have 30 OSDs per chassis and a chart with 30 individual lines on it would be kind of tough to read. Though on less dense nodes that would probably be the way to go.
|
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com