Re: OSD - Slow Requests

"Garg, Pankaj" <Pankaj.Garg@xxxxxxxxxxxxxxxxxx> · Thu, 5 May 2016 23:54:00 +0000

HI Christian,
Thanks for your response. But strangely enough, this is a new problem. I have used the same cluster and hardware for over a year. I have my drives in a new chassis now, and that is the only change.
My problem OSDs, change if I just reboot the system. Also, since this is benchmarking, when I reach my limit, it should throttle, and not have errors.
BTW, sometimes I'm able to run my whole benchmark writes, without any issues, and other times I see these errors.

Thanks
Pankaj

-----Original Message-----
From: Christian Balzer [mailto:chibi@xxxxxxx] 
Sent: Wednesday, May 04, 2016 9:01 PM
To: ceph-users@xxxxxxxxxxxxxx
Cc: Garg, Pankaj
Subject: Re:  OSD - Slow Requests

Hello,

On Wed, 4 May 2016 21:08:02 +0000 Garg, Pankaj wrote:

> Hi,
> 
> I am getting messages like the following from my Ceph systems. 
> Normally this would indicate issues with Drives. But when I restart my 
> system, different and randomly a couple OSDs again start spitting out 
> the same message. SO definitely it's not the same drives every time.
> 
> Any ideas on how to debug this. I don't see any drive related issues 
> in dmesg log either.
>

Drives having issues (as in being slow due to errors or firmware bugs) is a possible reason, but it would be not at the top of my list.

You want to run atop, iostat or the likes and graph actual drive and various Ceph performance counters to see what is going on and if a particular drive is slower than the rest or if your whole system is just reaching the limit of its performance.

Looking at your ceph log output, the first thing that catches the eye is that all slow objects are for benchmark runs (rados bench), so you seem to stress testing the cluster and have found its limits...

In addition to that all the slow requests include osd.84, so you might give that one a closer look. 
But that could of course be a coincidence due to limited log samples.

Christian

> Thanks
> Pankaj
> 
> 
> 
> 2016-05-04 14:02:52.499115 osd.72 [WRN] slow request 30.429347 seconds 
> old, received at 2016-05-04 14:02:22.069658:
> osd_op(client.2859198.0:9559 benchmark_data_x86Ceph3_54385_object9558
> [write 0~131072] 309.17ee1e0e ack+ondisk+write+known_if_redirected
> e14815) currently waiting for subops from 84,104 2016-05-04
> 14:02:54.499453 osd.72 [WRN] 24 slow requests, 1 included below; 
> oldest blocked for > 52.866778 secs 2016-05-04 14:02:54.499467 osd.72 
> [WRN] slow request 30.660900 seconds old, received at 2016-05-04
> 14:02:23.838455: osd_op(client.2859198.0:9661
> benchmark_data_x86Ceph3_54385_object9660 [write 0~131072] 309.4054960e
> ack+ondisk+write+known_if_redirected e14815) currently waiting for
> subops from 84,104 2016-05-04 14:02:56.499822 osd.72 [WRN] 25 slow 
> requests, 1 included below; oldest blocked for > 54.867154 secs
> 2016-05-04 14:02:56.499835 osd.72 [WRN] slow request 30.940457 seconds 
> old, received at 2016-05-04 14:02:25.559273:
> osd_op(client.2859197.0:9796 benchmark_data_x86Ceph1_24943_object9795
> [write 0~131072] 308.7e0944a ack+ondisk+write+known_if_redirected
> e14815) currently waiting for subops from 84,97 2016-05-04
> 14:02:59.140562 osd.84 [WRN] 33 slow requests, 1 included below; 
> oldest blocked for > 58.267177 secs
> 
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com