Re: many slow requests on different osds - STRANGE!

Ritter Sławomir <Slawomir.Ritter@xxxxxxxxxxxx> · Thu, 23 Apr 2015 11:21:53 +0200

=== Facts ===
1. RadosGW disabled, rados bench write -  10 x bigger traffic served without any slow_request. 
2. RadosGW enabled  - first slow_requests.
3. Traffic via RadosGW – 20-50 slow_requests per minute (~0,1% of IOPS)
4. We compacted leveldb on MONs 48h before first slow_requests.  Maybe there is some strange correlation?

=== Details ===
ceph.log 
2015-04-22 17:46:33.805926 osd.165 10.176.131.17:6810/98485 3 : [WRN] 1 slow requests, 1 included below; oldest blocked for > 36.621691 secs
2015-04-22 17:46:33.805933 osd.165 10.176.131.17:6810/98485 4 : [WRN] slow request 36.621691 seconds old, received at 2015-04-22 17:45:57.184143: osd_op(client.22009265.0:244576 default.2139585.95_cache/6f63f82baac53e854c126598c6dc8aaf_800x600_90.jpg [create 0~0,setxattr user.rgw.idtag (23),writefull 0~9971,setxattr user.rgw.manifest (296),setxattr user.rgw.acl (129),setxattr user.rgw.content_type (25),setxattr user.rgw.etag (33),setxattr user.rgw.x-amz-meta-md5 (33),setxattr user.rgw.x-amz-meta-sync-priority (5)] 10.58f70cfe e106833) v4 currently waiting for subops from [260,187]

heartbeat before slow_req:
2015-04-22 17:46:13.059658 7fcb9f64c700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fcb94636700' had timed out after 15
2015-04-22 17:46:13.059669 7fcb9f64c700 20 heartbeat_map is_healthy = NOT HEALTHY
2015-04-22 17:46:13.059677 7fcb9f64c700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fcb94636700' had timed out after 15
2015-04-22 17:46:13.059683 7fcb9f64c700 20 heartbeat_map is_healthy = NOT HEALTHY
2015-04-22 17:46:13.059693 7fcba064e700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fcb94636700' had timed out after 15
2015-04-22 17:46:13.059701 7fcba064e700 20 heartbeat_map is_healthy = NOT HEALTHY
….
2015-04-22 17:46:33.794865 7fcb9f64c700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fcb94636700' had timed out after 15
2015-04-22 17:46:33.794877 7fcb9f64c700 20 heartbeat_map is_healthy = NOT HEALTHY
2015-04-22 17:46:33.795912 7fcba064e700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fcb94636700' had timed out after 15
2015-04-22 17:46:33.795923 7fcba064e700 20 heartbeat_map is_healthy = NOT HEALTHY

Loosy connections a few second before slow_requests, even on loopbacks:
/var/log/ceph/ceph-osd.30.log:2015-04-21 14:57:43.471808 7ffcd7837700 10 -- 10.176.139.4:6829/2938116 >> 10.176.131.4:0/2937379 pipe(0x103ce780 sd=180 :6829 s=2 pgs=121 cs=1 l=1 c=0x103cbdc0).fault on lossy channel, failing
/var/log/ceph/ceph-osd.30.log:2015-04-21 14:57:43.471828 7ffcd7736700 10 -- 10.176.131.4:6821/2938116 >> 10.176.131.4:0/2937379 pipe(0x103b5280 sd=742 :6821 s=2 pgs=122 cs=1 l=1 c=0x103beb00).fault on lossy channel, failing
/var/log/ceph/ceph-osd.30.log:2015-04-21 14:57:44.229544 7ffcfd28f700 10 -- 10.176.139.4:6829/2938116 >> 10.176.139.11:0/3658878 pipe(0x1073f280 sd=735 :6829 s=2 pgs=1919 cs=1 l=1 c=0xff6dc60).fault on lossy channel, failing
/var/log/ceph/ceph-osd.30.log:2015-04-21 14:57:44.229670 7ffcf813e700 10 -- 10.176.139.4:6829/2938116 >> 10.176.139.29:0/2882509 pipe(0x17f74000 sd=862 :6829 s=2 pgs=582 cs=1 l=1 c=0x101e96e0).fault on lossy channel, failing
/var/log/ceph/ceph-osd.30.log:2015-04-21 14:57:44.229676 7ffcf4f0c700 10 -- 10.176.139.4:6829/2938116 >> 10.176.139.35:0/2697022 pipe(0xff8ba00 sd=190 :6829 s=2 pgs=13 cs=1 l=1 c=0xff6d420).fault on lossy channel, failing

Network
We checked it twice with iperf, ping, etc. It works OK, sysctls also are optimized. We are observing a peaks of tcp rst and tcp drops when there are slow_req’s, but it seems that this is caused by the problem.

Drives/Machines
We checked our drives (smart), raid contolers and machines and everything looks correctly. We disable caches on raid contolers and on the drives and nothing change. There is no leader in the histogram of slow requests. Any drive or machine with bigger problems then others. 

Regards,

Slawomir Ritter

From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Craig Lewis
Sent: Friday, April 17, 2015 7:56 PM
To: Dominik Mostowiec
Cc: Studziński Krzysztof; ceph-users@xxxxxxxxxxxxxx
Subject: Re: [ceph-users] many slow requests on different osds (scrubbing disabled)

I've seen something like this a few times. 

Once, I lost the battery in my battery backed RAID card.  That caused all the OSDs on that host to be slow, which triggered slow request notices pretty much cluster wide.  It was only when I histogrammed the slow request notices that I saw most of them were on a single node.  I compared the disk latency graphs between nodes, and saw that one node had a much higher write latency. This took me a while to track down.

Another time, I had a consume HDD that was slowly failing.  It would hit a group of bad sector, remap, repeat.  SMART warned me about it, so I replaced the disk after the second slow request alerts.  This was pretty straight forward to diagnose, only because smartd notified me.

I both cases, I saw "slow request" notices on the affect disks.  Your osd.284 says osd.186 and osd.177 are being slow, but osd.186 and osd.177 don't claim to be slow.  

It's possible that their is another disk that is slow, causing osd.186 and osd.177 replication to slow down.  With the PG distribution over OSDs, one disk being a little slow can affect a large number of OSDs.

If SMART doesn't show you a disk is failing, I'd start looking for disks (the disk itself, not the OSD daemon) with a high latency around your problem times.  If you focus on the problem times, give it a +/- 10 minutes window.  Sometimes it takes a little while for the disk slowness to spread out enough for Ceph to complain.

On Wed, Apr 15, 2015 at 3:20 PM, Dominik Mostowiec <dominikmostowiec@xxxxxxxxx> wrote:
Hi,
From few days we notice on our cluster many slow request.
Cluster:
ceph version 0.67.11
3 x mon
36 hosts -> 10 osd ( 4T ) + 2 SSD (journals)
Scrubbing and deep scrubbing is disabled but count of slow requests is
still increasing.
Disk utilisation is very small after we have disabled scrubbings.
Log from one write with slow with debug osd = 20/20
osd.284 - master: http://pastebin.com/xPtpNU6n
osd.186 - replica: http://pastebin.com/NS1gmhB0
osd.177 - replica: http://pastebin.com/Ln9L2Z5Z

Can you help me find what is reason of it?

--
Regards
Dominik
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com