Re: many slow requests on different osds (scrubbing disabled)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I've seen something like this a few times. 

Once, I lost the battery in my battery backed RAID card.  That caused all the OSDs on that host to be slow, which triggered slow request notices pretty much cluster wide.  It was only when I histogrammed the slow request notices that I saw most of them were on a single node.  I compared the disk latency graphs between nodes, and saw that one node had a much higher write latency. This took me a while to track down.

Another time, I had a consume HDD that was slowly failing.  It would hit a group of bad sector, remap, repeat.  SMART warned me about it, so I replaced the disk after the second slow request alerts.  This was pretty straight forward to diagnose, only because smartd notified me.


I both cases, I saw "slow request" notices on the affect disks.  Your osd.284 says osd.186 and osd.177 are being slow, but osd.186 and osd.177 don't claim to be slow.  

It's possible that their is another disk that is slow, causing osd.186 and osd.177 replication to slow down.  With the PG distribution over OSDs, one disk being a little slow can affect a large number of OSDs.


If SMART doesn't show you a disk is failing, I'd start looking for disks (the disk itself, not the OSD daemon) with a high latency around your problem times.  If you focus on the problem times, give it a +/- 10 minutes window.  Sometimes it takes a little while for the disk slowness to spread out enough for Ceph to complain.


On Wed, Apr 15, 2015 at 3:20 PM, Dominik Mostowiec <dominikmostowiec@xxxxxxxxx> wrote:
Hi,
>From few days we notice on our cluster many slow request.
Cluster:
ceph version 0.67.11
3 x mon
36 hosts -> 10 osd ( 4T ) + 2 SSD (journals)
Scrubbing and deep scrubbing is disabled but count of slow requests is
still increasing.
Disk utilisation is very small after we have disabled scrubbings.
Log from one write with slow with debug osd = 20/20
osd.284 - master: http://pastebin.com/xPtpNU6n
osd.186 - replica: http://pastebin.com/NS1gmhB0
osd.177 - replica: http://pastebin.com/Ln9L2Z5Z

Can you help me find what is reason of it?

--
Regards
Dominik
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux