Re: Slow OSD detection

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



When I run "ceph --admin-daemon perf dump" by pointing at a osd admin
socket, I get a lot of performance related data. I see a few values
that are of particular interest:

1. filestore : journal_latency - This is a long running average value
2. osd : op_w_latency - long running average
3. osd : op_w_process_latency - long running average.

Since there is a count and a sum assoicated with these type of values,
we can potentially get an average value over a very smal time period,
by reading in quick succession.

Would it be correct to say that journal write latency decides
"slowness" of a OSD, since this is a sync. operation?

Second question: It is easy to say which OSD is slow in a given
server(node). But to compare it against all servers in the cluster, we
need a mechanism to make the per-server data available at one single
point for comparing and reporting. Instead of creating another
cluster-wide process for this purpose, can we use the ceph monitor for
this purpose?

-Sreenath


On 11/24/14, Sreenath BH <bhsreenath@xxxxxxxxx> wrote:
> I think we could find five top OSDs which has the maximum average slow
> times, as well as five OSDs with absolute maximum time.
>
> Should we also be correlating this with SMART data associated with the
> disk?
> Some agency has to do the comparison in a storage node and make this
> available to other  nodes to compare with their own data.
>
> -Sreenath
>
> On 11/22/14, Samuel Just <sam.just@xxxxxxxxxxx> wrote:
>> The challenge I think is that "slow osd" is probably a global
>> question.  That is, I think it requires the agent to compare a given
>> osd to the other osds in the cluster (and to itself earlier in time).
>> -Sam
>>
>> On Fri, Nov 21, 2014 at 1:07 PM, Mark Nelson <mark.nelson@xxxxxxxxxxx>
>> wrote:
>>> It'd be nice if something like slow OSD detection could exist outside of
>>> calamari and itself by an event that we record in the logs and make
>>> available via the admin socket (so that calamari could pick it up). That
>>> way
>>> folks could get it into logstash and other system monitoring tools (say
>>> PCP/Nagios/etc).
>>>
>>> Mark
>>>
>>>
>>> On 11/21/2014 02:58 PM, Samuel Just wrote:
>>>>
>>>> It's still an open item.  #ceph-devel would be a good place to bounce
>>>> ideas.  Through the admin_socket and perf_counter machinery, the osds
>>>> already expose a bunch of information about queue length, latency,
>>>> etc.  This might actually fit well in calamari, which already gathers
>>>> a bunch of those stats.
>>>> -Sam
>>>>
>>>> On Thu, Nov 20, 2014 at 9:00 PM, Sreenath BH <bhsreenath@xxxxxxxxx>
>>>> wrote:
>>>>>
>>>>> Hi All
>>>>>
>>>>> Slow OSD detection is mentioned as one of the projects ideas in
>>>>> https://wiki.ceph.com/Development/Project_Ideas
>>>>>
>>>>> I am interested in implementing this. Is this still an open item?
>>>>>
>>>>> thanks,
>>>>> Sreenath
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in
>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in
>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux