Huge issues with slow requests

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

Sorry for the lack of information yesterday, this was "solved" after some 30 minutes, after having reloaded/restarted all osd daemons.
Unfortunately we couldn?t pin point it to a single OSD or drive, all drives seemed ok, some had a bit higher latency and we tried to out / in them to see if it had any effect which it did not.

The cluster consists of 3 mon servrers, 5 OSD servers with 10 enterprise HDDs backed with 2 S3700 SSDs for journals each. OSD servers have 256GB of RAM, 2x E5-2630 v2 @ 2.60GHz CPUs.

The log that I posted yesterday was just a small taste of the full one ;) They were all pointing to different osd?s that they were waiting for.
We?re also monitoring all of the VMs running on KVM, and we didn?t see any exceptional throughput or iops usage before or during this event. We were checking iostat etc and nothing was out of the ordinary..

Going to double check SMART and also see if we can off load some of the cluster in any way. If you have any other advice that?d be appreciated :)

Thanks for your help!

Kind Regards,
David

5 sep 2014 kl. 07:30 skrev Martin B Nielsen <martin at unity3d.com>:

> Just echoing what Christian said.
> 
> Also, iirc the "currently waiting for subobs on [" could also mean a problem on those as it waits for ack from them (I might remember wrong).
> 
> If that is the case you might want to check in on osd 13 & 37 as well.
> 
> With the cluster load and size you should not have this problem; I'm pretty sure you're dealing with a rogue/faulty osd/node somewhere.
> 
> Cheers,
> Martin
> 
> 
> On Fri, Sep 5, 2014 at 2:28 AM, Christian Balzer <chibi at gol.com> wrote:
> On Thu, 4 Sep 2014 12:02:13 +0200 David wrote:
> 
> > Hi,
> >
> > We?re running a ceph cluster with version:
> >
> > 0.67.7-1~bpo70+1
> >
> > All of a sudden we?re having issues with the cluster (running RBD images
> > for kvm) with slow requests on all of the OSD servers. Any idea why and
> > how to fix it?
> >
> You give us a Ceph version at least, but for anybody to make guesses we
> need much more information than a log spew.
> 
> How many nodes/OSDs, OS, hardware, OSD details (FS, journals on SSDs), etc.
> 
> Run atop (in a sufficiently large terminal) on all your nodes, see if you
> can spot a bottleneck, like a disk being at 100% all the time with a
> much higher avio than the others.
> Looking at your logs, I'd pay particular attention to the disk holding
> osd.22.
> A single slow disk can bring a whole large cluster to a crawl.
> If you're using a hardware controller with a battery backed up cache,
> check if that is fine, loss of the battery would switch from writeback to
> writethrough and massively slow down IOPS.
> 
> Regards,
> 
> Christian
> >
> > 2014-09-04 11:56:35.868521 mon.0 [INF] pgmap v12504451: 6860 pgs: 6860
> > active+clean; 12163 GB data, 36308 GB used, 142 TB / 178 TB avail;
> > 634KB/s rd, 487KB/s wr, 90op/s 2014-09-04 11:56:29.510270 osd.22 [WRN]
> > 15 slow requests, 1 included below; oldest blocked for > 44.745754 secs
> > 2014-09-04 11:56:29.510276 osd.22 [WRN] slow request 30.999821 seconds
> > old, received at 2014-09-04 11:55:58.510424:
> > osd_op(client.10731617.0:81868956
> > rbd_data.967e022eb141f2.0000000000000e72 [write 0~4194304] 3.c585cebe
> > e13901) v4 currently waiting for subops from [37,13] 2014-09-04
> > 11:56:30.510528 osd.22 [WRN] 21 slow requests, 6 included below; oldest
> > blocked for > 45.745989 secs 2014-09-04 11:56:30.510534 osd.22 [WRN]
> > slow request 30.122555 seconds old, received at 2014-09-04
> > 11:56:00.387925: osd_op(client.13425082.0:11962345
> > rbd_data.54f24c3d1b58ba.0000000000003753 [stat,write 1114112~8192]
> > 3.c9e49140 e13901) v4 currently waiting for subops from [13,42]
> > 2014-09-04 11:56:30.510537 osd.22 [WRN] slow request 30.122362 seconds
> > old, received at 2014-09-04 11:56:00.388118:
> > osd_op(client.13425082.0:11962352
> > rbd_data.54f24c3d1b58ba.0000000000003753 [stat,write 1134592~4096]
> > 3.c9e49140 e13901) v4 currently waiting for subops from [13,42]
> > 2014-09-04 11:56:30.510541 osd.22 [WRN] slow request 30.122298 seconds
> > old, received at 2014-09-04 11:56:00.388182:
> > osd_op(client.13425082.0:11962353
> > rbd_data.54f24c3d1b58ba.0000000000003753 [stat,write 4046848~8192]
> > 3.c9e49140 e13901) v4 currently waiting for subops from [13,42]
> > 2014-09-04 11:56:30.510544 osd.22 [WRN] slow request 30.121577 seconds
> > old, received at 2014-09-04 11:56:00.388903:
> > osd_op(client.13425082.0:11962374
> > rbd_data.54f24c3d1b58ba.00000000000047f2 [stat,write 2527232~4096]
> > 3.cd9a9015 e13901) v4 currently waiting for subops from [45,1]
> > 2014-09-04 11:56:30.510548 osd.22 [WRN] slow request 30.121518 seconds
> > old, received at 2014-09-04 11:56:00.388962:
> > osd_op(client.13425082.0:11962375
> > rbd_data.54f24c3d1b58ba.00000000000047f2 [stat,write 3133440~4096]
> > 3.cd9a9015 e13901) v4 currently waiting for subops from [45,1]
> > 2014-09-04 11:56:31.510706 osd.22 [WRN] 26 slow requests, 6 included
> > below; oldest blocked for > 46.746163 secs 2014-09-04 11:56:31.510711
> > osd.22 [WRN] slow request 31.035418 seconds old, received at 2014-09-04
> > 11:56:00.475236: osd_op(client.9266625.0:135900595
> > rbd_data.42d6792eb141f2.000000000000bc00 [stat,write 2097152~4096]
> > 3.a2894ebe e13901) v4 currently waiting for subops from [37,13]
> > 2014-09-04 11:56:31.510715 osd.22 [WRN] slow request 31.035335 seconds
> > old, received at 2014-09-04 11:56:00.475319:
> > osd_op(client.9266625.0:135900596
> > rbd_data.42d6792eb141f2.000000000000bc00 [stat,write 2162688~4096]
> > 3.a2894ebe e13901) v4 currently waiting for subops from [37,13]
> > 2014-09-04 11:56:31.510718 osd.22 [WRN] slow request 31.035270 seconds
> > old, received at 2014-09-04 11:56:00.475384:
> > osd_op(client.9266625.0:135900597
> > rbd_data.42d6792eb141f2.000000000000bc00 [stat,write 2400256~16384]
> > 3.a2894ebe e13901) v4 currently waiting for subops from [37,13]
> > 2014-09-04 11:56:31.510721 osd.22 [WRN] slow request 31.035093 seconds
> > old, received at 2014-09-04 11:56:00.475561:
> > osd_op(client.9266625.0:135900598
> > rbd_data.42d6792eb141f2.000000000000bc00 [stat,write 2420736~4096]
> > 3.a2894ebe e13901) v4 currently waiting for subops from [37,13]
> > 2014-09-04 11:56:31.510724 osd.22 [WRN] slow request 31.034990 seconds
> > old, received at 2014-09-04 11:56:00.475664:
> > osd_op(client.9266625.0:135900599
> > rbd_data.42d6792eb141f2.000000000000bc00 [stat,write 2428928~12288]
> > 3.a2894ebe e13901) v4 currently waiting for subops from [37,13]
> > 2014-09-04 11:56:32.510864 osd.22 [WRN] 28 slow requests, 6 included
> > below; oldest blocked for > 47.746324 secs 2014-09-04 11:56:32.510871
> > osd.22 [WRN] slow request 32.035026 seconds old, received at 2014-09-04
> > 11:56:00.475789: osd_op(client.9266625.0:135900600
> > rbd_data.42d6792eb141f2.000000000000bc00 [stat,write 2453504~4096]
> > 3.a2894ebe e13901) v4 currently waiting for subops from [37,13]
> > 2014-09-04 11:56:32.510876 osd.22 [WRN] slow request 32.034982 seconds
> > old, received at 2014-09-04 11:56:00.475833:
> > osd_op(client.9266625.0:135900601
> > rbd_data.42d6792eb141f2.000000000000bc00 [stat,write 2502656~16384]
> > 3.a2894ebe e13901) v4 currently waiting for subops from [37,13]
> > 2014-09-04 11:56:32.510881 osd.22 [WRN] slow request 32.034917 seconds
> > old, received at 2014-09-04 11:56:00.475898:
> > osd_op(client.9266625.0:135900602
> > rbd_data.42d6792eb141f2.000000000000bc00 [stat,write 2600960~4096]
> > 3.a2894ebe e13901) v4 currently waiting for subops from [37,13]
> > 2014-09-04 11:56:32.510885 osd.22 [WRN] slow request 32.034807 seconds
> > old, received at 2014-09-04 11:56:00.476008:
> > osd_op(client.9266625.0:135900603
> > rbd_data.42d6792eb141f2.000000000000bc00 [stat,write 2617344~4096]
> > 3.a2894ebe e13901) v4 currently waiting for subops from [37,13]
> > 2014-09-04 11:56:32.510890 osd.22 [WRN] slow request 32.034756 seconds
> > old, received at 2014-09-04 11:56:00.476059:
> > osd_op(client.9266625.0:135900604
> > rbd_data.42d6792eb141f2.000000000000bc00 [stat,write 2637824~69632]
> > 3.a2894ebe e13901) v4 currently waiting for subops from [37,13]
> > 2014-09-04 11:56:32.723329 osd.17 [WRN] 4 slow requests, 4 included
> > below; oldest blocked for > 30.605794 secs 2014-09-04 11:56:32.723335
> > osd.17 [WRN] slow request 30.605794 seconds old, received at 2014-09-04
> > 11:56:02.117495: osd_op(client.12646411.0:31118135
> > rbd_data.81aaf52eb141f2.0000000000000e4b [stat,write 0~4194304]
> > 3.716b373b e13901) v4 currently waiting for subops from [31,1]
> > 2014-09-04 11:56:32.723338 osd.17 [WRN] slow request 30.362042 seconds
> > old, received at 2014-09-04 11:56:02.361247:
> > osd_op(client.12485793.0:121786489
> > rbd_data.4ae87f2eb141f2.000000000000e200 [stat,write 2834432~4096]
> > 3.6b43de72 e13901) v4 currently waiting for subops from [5,31]
> > 2014-09-04 11:56:32.723341 osd.17 [WRN] slow request 30.179681 seconds
> > old, received at 2014-09-04 11:56:02.543608:
> > osd_op(client.13425079.0:7014393
> > rbd_data.54f2522eb141f2.0000000000014d01 [stat,write 2502656~4096]
> > 3.d8ceea89 e13901) v4 currently waiting for subops from [31,18]
> > 2014-09-04 11:56:32.723344 osd.17 [WRN] slow request 30.000568 seconds
> > old, received at 2014-09-04 11:56:02.722721:
> > osd_op(client.9281279.0:144230325
> > rbd_data.2d99fc2eb141f2.000000000000e80a [stat,write 1589248~4096]
> > 3.16dc0e89 e13901) v4 currently waiting for subops from [31,18]
> > 2014-09-04 11:56:33.511043 osd.22 [WRN] 33 slow requests, 6 included
> > below; oldest blocked for > 48.746486 secs 2014-09-04 11:56:33.511049
> > osd.22 [WRN] slow request 33.034756 seconds old, received at 2014-09-04
> > 11:56:00.476221: osd_op(client.9266625.0:135900605
> > rbd_data.42d6792eb141f2.000000000000bc00 [stat,write 2715648~8192]
> > 3.a2894ebe e13901) v4 currently waiting for subops from [37,13]
> > 2014-09-04 11:56:33.511053 osd.22 [WRN] slow request 33.034620 seconds
> > old, received at 2014-09-04 11:56:00.476357:
> > osd_op(client.9266625.0:135900606
> > rbd_data.42d6792eb141f2.000000000000bc00 [stat,write 2756608~4096]
> > 3.a2894ebe e13901) v4 currently waiting for subops from [37,13]
> > 2014-09-04 11:56:33.511056 osd.22 [WRN] slow request 33.034446 seconds
> > old, received at 2014-09-04 11:56:00.476531:
> > osd_op(client.9266625.0:135900607
> > rbd_data.42d6792eb141f2.000000000000bc00 [stat,write 2809856~4096]
> > 3.a2894ebe e13901) v4 currently waiting for subops from [37,13]
> > 2014-09-04 11:56:33.511060 osd.22 [WRN] slow request 33.034387 seconds
> > old, received at 2014-09-04 11:56:00.476590:
> > osd_op(client.9266625.0:135900608
> > rbd_data.42d6792eb141f2.000000000000bc00 [stat,write 2883584~4096]
> > 3.a2894ebe e13901) v4 currently waiting for subops from [37,13]
> > 2014-09-04 11:56:33.511063 osd.22 [WRN] slow request 33.034341 seconds
> > old, received at 2014-09-04 11:56:00.476636:
> > osd_op(client.9266625.0:135900609
> > rbd_data.42d6792eb141f2.000000000000bc00 [stat,write 2891776~8192]
> > 3.a2894ebe e13901) v4 currently waiting for subops from [37,13]
> > 2014-09-04 11:56:34.511206 osd.22 [WRN] 38 slow requests, 6 included
> > below; oldest blocked for > 49.746664 secs 2014-09-04 11:56:34.511215
> > osd.22 [WRN] slow request 34.034426 seconds old, received at 2014-09-04
> > 11:56:00.476729: osd_op(client.9266625.0:135900610
> > rbd_data.42d6792eb141f2.000000000000bc00 [stat,write 2908160~4096]
> > 3.a2894ebe e13901) v4 currently waiting for subops from [37,13]
> > 2014-09-04 11:56:34.511219 osd.22 [WRN] slow request 34.034298 seconds
> > old, received at 2014-09-04 11:56:00.476857:
> > osd_op(client.9266625.0:135900611
> > rbd_data.42d6792eb141f2.000000000000bc00 [stat,write 2932736~28672]
> > 3.a2894ebe e13901) v4 currently waiting for subops from [37,13]
> > 2014-09-04 11:56:34.511222 osd.22 [WRN] slow request 34.034234 seconds
> > old, received at 2014-09-04 11:56:00.476921:
> > osd_op(client.9266625.0:135900612
> > rbd_data.42d6792eb141f2.000000000000bc00 [stat,write 2969600~8192]
> > 3.a2894ebe e13901) v4 currently waiting for subops from [37,13]
> > 2014-09-04 11:56:34.511225 osd.22 [WRN] slow request 34.034147 seconds
> > old, received at 2014-09-04 11:56:00.477008:
> > osd_op(client.9266625.0:135900613
> > rbd_data.42d6792eb141f2.000000000000bc00 [stat,write 2981888~4096]
> > 3.a2894ebe e13901) v4 currently waiting for subops from [37,13]
> > 2014-09-04 11:56:34.511228 osd.22 [WRN] slow request 34.034055 seconds
> > old, received at 2014-09-04 11:56:00.477100:
> > osd_op(client.9266625.0:135900614
> > rbd_data.42d6792eb141f2.000000000000bc00 [stat,write 2998272~12288]
> > 3.a2894ebe e13901) v4 currently waiting for subops from [37,13]
> > 2014-09-04 11:56:35.153813 osd.37 [WRN] 1 slow requests, 1 included
> > below; oldest blocked for > 30.307864 secs 2014-09-04 11:56:35.153820
> > osd.37 [WRN] slow request 30.307864 seconds old, received at 2014-09-04
> > 11:56:04.845917: osd_op(client.10785413.0:129409910
> > rbd_data.96803f2eb141f2.00000000000042b0 [stat,write 1712128~4096]
> > 3.6c5ac950 e13901) v4 currently waiting for subops from [22,41]
> > 2014-09-04 11:56:35.511450 osd.22 [WRN] 42 slow requests, 6 included
> > below; oldest blocked for > 50.746638 secs 2014-09-04 11:56:35.511455
> > osd.22 [WRN] slow request 35.034219 seconds old, received at 2014-09-04
> > 11:56:00.477178: osd_op(client.9266625.0:135900615
> > rbd_data.42d6792eb141f2.000000000000bc00 [stat,write 3022848~8192]
> > 3.a2894ebe e13901) v4 currently waiting for subops from [37,13]
> > 2014-09-04 11:56:35.511459 osd.22 [WRN] slow request 35.034143 seconds
> > old, received at 2014-09-04 11:56:00.477254:
> > osd_op(client.9266625.0:135900616
> > rbd_data.42d6792eb141f2.000000000000bc00 [stat,write 3035136~12288]
> > 3.a2894ebe e13901) v4 currently waiting for subops from [37,13]
> > 2014-09-04 11:56:35.511462 osd.22 [WRN] slow request 35.034082 seconds
> > old, received at 2014-09-04 11:56:00.477315:
> > osd_op(client.9266625.0:135900617
> > rbd_data.42d6792eb141f2.000000000000bc00 [stat,write 3055616~4096]
> > 3.a2894ebe e13901) v4 currently waiting for subops from [37,13]
> > 2014-09-04 11:56:35.511465 osd.22 [WRN] slow request 35.033965 seconds
> > old, received at 2014-09-04 11:56:00.477432:
> > osd_op(client.9266625.0:135900618
> > rbd_data.42d6792eb141f2.000000000000bc00 [stat,write 3063808~4096]
> > 3.a2894ebe e13901) v4 currently waiting for subops from [37,13]
> > 2014-09-04 11:56:35.511469 osd.22 [WRN] slow request 34.162275 seconds
> > old, received at 2014-09-04 11:56:01.349122:
> > osd_op(client.12485646.0:110385143
> > rbd_data.4ccab12eb141f2.0000000000015b66 [stat,write 909312~4096]
> > 3.9c416abe e13901) v4 currently waiting for subops from [37,13]
> > 2014-09-04 11:56:36.511603 osd.22 [WRN] 34 slow requests, 6 included
> > below; oldest blocked for > 51.746797 secs 2014-09-04 11:56:36.511609
> > osd.22 [WRN] slow request 34.401476 seconds old, received at 2014-09-04
> > 11:56:02.110080: osd_op(client.12485643.0:117955489
> > rbd_data.4aad332eb141f2.0000000000002c00 [stat,write 2531328~4096]
> > 3.b985ec92 e13901) v4 currently waiting for subops from [33,1]
> > 2014-09-04 11:56:36.511612 osd.22 [WRN] slow request 34.149511 seconds
> > old, received at 2014-09-04 11:56:02.362045:
> > osd_op(client.12485793.0:121786505
> > rbd_data.4ae87f2eb141f2.0000000000018a0a [stat,write 3342336~4096]
> > 3.26907492 e13901) v4 currently waiting for subops from [33,1]
> > 2014-09-04 11:56:36.511615 osd.22 [WRN] slow request 34.329528 seconds
> > old, received at 2014-09-04 11:56:02.182028:
> > osd_op(client.12646411.0:31118266
> > rbd_data.81aaf52eb141f2.0000000000000e4f [stat,write 0~4194304]
> > 3.168c978d e13901) v4 currently waiting for subops from [49,29]
> > 2014-09-04 11:56:36.511618 osd.22 [WRN] slow request 31.871976 seconds
> > old, received at 2014-09-04 11:56:04.639580:
> > osd_op(client.12485793.0:121786506
> > rbd_data.4ae87f2eb141f2.0000000000018a0a [stat,write 0~4096] 3.26907492
> > e13901) v4 currently waiting for subops from [33,1] 2014-09-04
> > 11:56:36.511621 osd.22 [WRN] slow request 31.008867 seconds old,
> > received at 2014-09-04 11:56:05.502689:
> > osd_op(client.9266625.0:135901516
> > rbd_data.42d6792eb141f2.0000000000017436 [stat,write 2682880~4096]
> > 3.32067b97 e13901) v4 currently waiting for subops from [12,29]
> > 2014-09-04 11:56:36.723882 osd.17 [WRN] 5 slow requests, 1 included
> > below; oldest blocked for > 34.606349 secs 2014-09-04 11:56:36.723887
> > osd.17 [WRN] slow request 30.603057 seconds old, received at 2014-09-04
> > 11:56:06.120787: osd_op(client.12483520.0:11473870
> > rbd_data.4b8e9b3d1b58ba.0000000000004600 [stat,write 2293760~4096]
> > 3.49398689 e13901) v4 currently waiting for subops from [31,18]
> > 2014-09-04 11:56:37.511787 osd.22 [WRN] 24 slow requests, 6 included
> > below; oldest blocked for > 52.746984 secs 2014-09-04 11:56:37.511793
> > osd.22 [WRN] slow request 32.008986 seconds old, received at 2014-09-04
> > 11:56:05.502757: osd_op(client.9266625.0:135901517
> > rbd_data.42d6792eb141f2.0000000000017436 [stat,write 2695168~4096]
> > 3.32067b97 e13901) v4 currently waiting for subops from [12,29]
> > 2014-09-04 11:56:37.511798 osd.22 [WRN] slow request 31.749483 seconds
> > old, received at 2014-09-04 11:56:05.762260:
> > osd_op(client.9281282.0:55728036
> > rbd_data.2d99f62eb141f2.0000000000003753 [stat,write 1630208~8192]
> > 3.6625df97 e13901) v4 currently waiting for subops from [12,29]
> > 2014-09-04 11:56:37.511801 osd.22 [WRN] slow request 31.745558 seconds
> > old, received at 2014-09-04 11:56:05.766185:
> > osd_op(client.9281282.0:55728038
> > rbd_data.2d99f62eb141f2.0000000000003753 [stat,write 1941504~4096]
> > 3.6625df97 e13901) v4 currently waiting for subops from [12,29]
> > 2014-09-04 11:56:37.511803 osd.22 [WRN] slow request 31.390945 seconds
> > old, received at 2014-09-04 11:56:06.120798:
> > osd_op(client.12483520.0:11473871
> > rbd_data.4b8e9b3d1b58ba.0000000000005c00 [stat,write 2424832~8192]
> > 3.47dbbb97 e13901) v4 currently waiting for subops from [12,29]
> > 2014-09-04 11:56:37.511806 osd.22 [WRN] slow request 31.390357 seconds
> > old, received at 2014-09-04 11:56:06.121386:
> > osd_op(client.12483520.0:11473884
> > rbd_data.4b8e9b3d1b58ba.0000000000005c00 [stat,write 2457600~16384]
> > 3.47dbbb97 e13901) v4 currently waiting for subops from [12,29]
> >
> > Kind Regards,
> >
> > David
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users at lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> 
> 
> --
> Christian Balzer        Network/Systems Engineer
> chibi at gol.com           Global OnLine Japan/Fusion Communications
> http://www.gol.com/
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140905/b2c71dd4/attachment.htm>


[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux