Huge issues with slow requests

chibi@xxxxxxx (Christian Balzer) · Sun, 7 Sep 2014 01:05:42 +0900



Hello,

On Sat, 6 Sep 2014 17:52:59 +0200 Josef Johansson wrote:

> Hi,
> 
> Just realised that it could also be with a popularity bug as well and
> lots a small traffic. And seeing that it?s fast it gets popular until it
> hits the curb.
> 
I don't think I ever heard the term "popularity bug" before, care to
elaborate? 

> I?m seeing this in the stats I think.
> 
> Linux 3.13-0.bpo.1-amd64 (osd1) 	09/06/2014
> _x86_64_	(24 CPU)
Any particular reason you're not running 3.14?

> 
> 09/06/2014 05:48:41 PM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            2.21    0.00    1.00    2.86    0.00   93.93
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdm               0.02     1.47    7.05   42.72     0.67     1.07
> 71.43     0.41    8.17    6.41    8.46   3.44  17.13 sdn
> 0.03     1.42    6.17   37.08     0.57     0.92    70.51     0.08
> 1.76    6.47    0.98   3.46  14.98 sdg               0.03     1.44
> 6.27   36.62     0.56     0.94    71.40     0.34    8.00    6.83
> 8.20   3.45  14.78 sde               0.03     1.23    6.47   39.07
> 0.59     0.98    70.29     0.43    9.47    6.57    9.95   3.37  15.33
> sdf               0.02     1.26    6.47   33.77     0.61     0.87
> 75.30     0.22    5.39    6.00    5.27   3.52  14.17 sdl
> 0.03     1.44    6.44   40.54     0.59     1.08    72.68     0.21
> 4.49    6.56    4.16   3.40  15.95 sdk               0.03     1.41
> 5.62   35.92     0.52     0.90    70.10     0.15    3.58    6.17
> 3.17   3.45  14.32 sdj               0.03     1.26    6.30   34.23
> 0.57     0.83    70.84     0.31    7.65    6.56    7.85   3.48  14.10
> 
> Seeing that the drives are in pretty good shape but not giving lotsa
> read, I would assume that I need to tweak the cache to swallow more IO.
>
That looks indeed fine, as in, none of these disks looks suspicious to me.
 
> When I tweaked it before production I did not see any performance gains
> what so ever, so they are pretty low. And it?s odd because we just saw
> these problems a little while ago. So probably that we hit a limit where
> the disks are getting lot of IO.
> 
> I know that there?s some threads about this that I will read again.
>
URL?
 
Christian

> Thanks for the hints in looking at bad drives.
> 
> Regards,
> Josef
> 
> On 06 Sep 2014, at 17:41, Josef Johansson <josef at oderland.se> wrote:
> 
> > Hi,
> > 
> > On 06 Sep 2014, at 17:27, Christian Balzer <chibi at gol.com> wrote:
> > 
> >> 
> >> Hello,
> >> 
> >> On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote:
> >> 
> >>> We manage to go through the restore, but the performance degradation
> >>> is still there.
> >>> 
> >> Manifesting itself how?
> >> 
> > Awful slow io on the VMs, and iowait, it?s about 2MB/s or so.
> > But mostly a lot of iowait.
> > 
> >>> Looking through the OSDs to pinpoint a source of the degradation and
> >>> hoping the current load will be lowered.
> >>> 
> >> 
> >> You're the one looking at your cluster, the iostat, atop, iotop and
> >> whatnot data.
> >> If one particular OSD/disk stands out, investigate it, as per the
> >> "Good way to monitor detailed latency/throughput" thread. 
> >> 
> > Will read it through.
> >> If you have a spare and idle machine that is identical to your storage
> >> nodes, you could run a fio benchmark on a disk there and then compare
> >> the results to that of your suspect disk after setting your cluster
> >> to noout and stopping that particular OSD.
> > No spare though, but I have a rough idea what it should be, what?s I?m
> > going at right now. Right, so the cluster should be fine after I stop
> > the OSD right? I though of stopping it a little bit to see if the IO
> > was better afterwards from within the VMs. Not sure how good effect it
> > makes though since it may be waiting for the IO to complete what not.
> >> 
> >>> I?m a bit afraid of doing the 0 to weight of an OSD, wouldn?t it be
> >>> tough if the degradation is still there afterwards? i.e. if I set
> >>> back the weight would it move back all the PGs?
> >>> 
> >> Of course.
> >> 
> >> Until you can determine that a specific OSD/disk is the culprit,
> >> don't do that. 
> >> If you have the evidence, go ahead.
> >> 
> > Great, that?s what I though as well.
> >> Regards,
> >> 
> >> Christian
> >> 
> >>> Regards,
> >>> Josef
> >>> 
> >>> On 06 Sep 2014, at 15:52, Josef Johansson <josef at oderland.se> wrote:
> >>> 
> >>>> FWI I did restart the OSDs until I saw a server that made impact.
> >>>> Until that server stopped doing impact, I didn?t get lower in the
> >>>> number objects being degraded. After a while it was done with
> >>>> recovering that OSD and happily started with others. I guess I will
> >>>> be seeing the same behaviour when it gets to replicating the same
> >>>> PGs that were causing troubles the first time.
> >>>> 
> >>>> On 06 Sep 2014, at 15:04, Josef Johansson <josef at oderland.se> wrote:
> >>>> 
> >>>>> Actually, it only worked with restarting  for a period of time to
> >>>>> get the recovering process going. Can?t get passed the 21k object
> >>>>> mark.
> >>>>> 
> >>>>> I?m uncertain if the disk really is messing this up right now as
> >>>>> well. So I?m not glad to start moving 300k objects around.
> >>>>> 
> >>>>> Regards,
> >>>>> Josef
> >>>>> 
> >>>>> On 06 Sep 2014, at 14:33, Josef Johansson <josef at oderland.se>
> >>>>> wrote:
> >>>>> 
> >>>>>> Hi,
> >>>>>> 
> >>>>>> On 06 Sep 2014, at 13:53, Christian Balzer <chibi at gol.com> wrote:
> >>>>>> 
> >>>>>>> 
> >>>>>>> Hello,
> >>>>>>> 
> >>>>>>> On Sat, 6 Sep 2014 13:37:25 +0200 Josef Johansson wrote:
> >>>>>>> 
> >>>>>>>> Also putting this on the list.
> >>>>>>>> 
> >>>>>>>> On 06 Sep 2014, at 13:36, Josef Johansson <josef at oderland.se>
> >>>>>>>> wrote:
> >>>>>>>> 
> >>>>>>>>> Hi,
> >>>>>>>>> 
> >>>>>>>>> Same issues again, but I think we found the drive that causes
> >>>>>>>>> the problems.
> >>>>>>>>> 
> >>>>>>>>> But this is causing problems as it?s trying to do a recover to
> >>>>>>>>> that osd at the moment.
> >>>>>>>>> 
> >>>>>>>>> So we?re left with the status message 
> >>>>>>>>> 
> >>>>>>>>> 2014-09-06 13:35:07.580007 mon.0 [INF] pgmap v12678802: 6860
> >>>>>>>>> pgs: 6841 active+clean, 19 active+remapped+backfilling; 12299
> >>>>>>>>> GB data, 36882 GB used, 142 TB / 178 TB avail; 1921KB/s rd,
> >>>>>>>>> 192KB/s wr, 74op/s; 41424/15131923 degraded (0.274%);
> >>>>>>>>> recovering 0 o/s, 2035KB/s
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> It?s improving, but way too slowly. If I restart the recovery
> >>>>>>>>> (ceph osd set no recovery /unset) it doesn?t change the osd
> >>>>>>>>> what I can see.
> >>>>>>>>> 
> >>>>>>>>> Any ideas?
> >>>>>>>>> 
> >>>>>>> I don't know the state of your cluster, i.e. what caused the
> >>>>>>> recovery to start (how many OSDs went down?).
> >>>>>> Performance degradation, databases are the worst impacted. It?s
> >>>>>> actually a OSD that we put in that?s causing it (removed it again
> >>>>>> though). So the cluster in itself is healthy.
> >>>>>> 
> >>>>>>> If you have a replication of 3 and only one OSD was involved,
> >>>>>>> what is stopping you from taking that wonky drive/OSD out?
> >>>>>>> 
> >>>>>> There?s data that goes missing if I do that, I guess I have to
> >>>>>> wait for the recovery process to complete before I can go any
> >>>>>> further, this is with rep 3.
> >>>>>>> If you don't know that or want to play it safe, how about setting
> >>>>>>> the weight of that OSD to 0? 
> >>>>>>> While that will AFAICT still result in all primary PGs to be
> >>>>>>> evacuated off it, no more writes will happen to it and reads
> >>>>>>> might be faster. In either case, it shouldn't slow down the rest
> >>>>>>> of your cluster anymore.
> >>>>>>> 
> >>>>>> That?s actually one idea I haven?t thought off, I wan?t to play it
> >>>>>> safe right now and hope that it goes up again, I actually found
> >>>>>> one wonky way of getting the recovery process from not stalling
> >>>>>> to a grind, and that was restarting OSDs. One at the time.
> >>>>>> 
> >>>>>> Regards,
> >>>>>> Josef
> >>>>>>> Regards,
> >>>>>>> 
> >>>>>>> Christian
> >>>>>>>>> Cheers,
> >>>>>>>>> Josef
> >>>>>>>>> 
> >>>>>>>>> On 05 Sep 2014, at 11:26, Luis Periquito
> >>>>>>>>> <luis.periquito at ocado.com> wrote:
> >>>>>>>>> 
> >>>>>>>>>> Only time I saw such behaviour was when I was deleting a big
> >>>>>>>>>> chunk of data from the cluster: all the client activity was
> >>>>>>>>>> reduced, the op/s were almost non-existent and there was
> >>>>>>>>>> unjustified delays all over the cluster. But all the disks
> >>>>>>>>>> were somewhat busy in atop/iotstat.
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> On 5 September 2014 09:51, David <david at visions.se> wrote:
> >>>>>>>>>> Hi,
> >>>>>>>>>> 
> >>>>>>>>>> Indeed strange.
> >>>>>>>>>> 
> >>>>>>>>>> That output was when we had issues, seems that most operations
> >>>>>>>>>> were blocked / slow requests.
> >>>>>>>>>> 
> >>>>>>>>>> A ?baseline? output is more like today:
> >>>>>>>>>> 
> >>>>>>>>>> 2014-09-05 10:44:29.123681 mon.0 [INF] pgmap v12582759: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 9273KB/s rd, 24650KB/s wr, 2755op/s
> >>>>>>>>>> 2014-09-05 10:44:30.125637 mon.0 [INF] pgmap v12582760: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 9500KB/s rd, 20430KB/s wr, 2294op/s
> >>>>>>>>>> 2014-09-05 10:44:31.139427 mon.0 [INF] pgmap v12582761: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 9216KB/s rd, 20062KB/s wr, 2488op/s
> >>>>>>>>>> 2014-09-05 10:44:32.144945 mon.0 [INF] pgmap v12582762: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 12511KB/s rd, 15739KB/s wr, 2488op/s
> >>>>>>>>>> 2014-09-05 10:44:33.161210 mon.0 [INF] pgmap v12582763: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 18593KB/s rd, 14880KB/s wr, 2609op/s
> >>>>>>>>>> 2014-09-05 10:44:34.187294 mon.0 [INF] pgmap v12582764: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 17720KB/s rd, 22964KB/s wr, 3257op/s
> >>>>>>>>>> 2014-09-05 10:44:35.190785 mon.0 [INF] pgmap v12582765: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 19230KB/s rd, 18901KB/s wr, 3199op/s
> >>>>>>>>>> 2014-09-05 10:44:36.213535 mon.0 [INF] pgmap v12582766: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 17630KB/s rd, 18855KB/s wr, 3131op/s
> >>>>>>>>>> 2014-09-05 10:44:37.220052 mon.0 [INF] pgmap v12582767: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 12262KB/s rd, 18627KB/s wr, 2595op/s
> >>>>>>>>>> 2014-09-05 10:44:38.233357 mon.0 [INF] pgmap v12582768: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 17697KB/s rd, 17572KB/s wr, 2156op/s
> >>>>>>>>>> 2014-09-05 10:44:39.239409 mon.0 [INF] pgmap v12582769: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 20300KB/s rd, 19735KB/s wr, 2197op/s
> >>>>>>>>>> 2014-09-05 10:44:40.260423 mon.0 [INF] pgmap v12582770: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 14656KB/s rd, 15460KB/s wr, 2199op/s
> >>>>>>>>>> 2014-09-05 10:44:41.269736 mon.0 [INF] pgmap v12582771: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 8969KB/s rd, 11918KB/s wr, 1951op/s
> >>>>>>>>>> 2014-09-05 10:44:42.276192 mon.0 [INF] pgmap v12582772: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 7272KB/s rd, 10644KB/s wr, 1832op/s
> >>>>>>>>>> 2014-09-05 10:44:43.291817 mon.0 [INF] pgmap v12582773: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 9316KB/s rd, 16610KB/s wr, 2412op/s
> >>>>>>>>>> 2014-09-05 10:44:44.295469 mon.0 [INF] pgmap v12582774: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 9257KB/s rd, 19953KB/s wr, 2633op/s
> >>>>>>>>>> 2014-09-05 10:44:45.315774 mon.0 [INF] pgmap v12582775: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 9718KB/s rd, 14298KB/s wr, 2101op/s
> >>>>>>>>>> 2014-09-05 10:44:46.326783 mon.0 [INF] pgmap v12582776: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 20877KB/s rd, 12822KB/s wr, 2447op/s
> >>>>>>>>>> 2014-09-05 10:44:47.327537 mon.0 [INF] pgmap v12582777: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 18447KB/s rd, 12945KB/s wr, 2226op/s
> >>>>>>>>>> 2014-09-05 10:44:48.348725 mon.0 [INF] pgmap v12582778: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 15514KB/s rd, 19366KB/s wr, 1892op/s
> >>>>>>>>>> 2014-09-05 10:44:49.357537 mon.0 [INF] pgmap v12582779: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 16640KB/s rd, 19694KB/s wr, 1887op/s
> >>>>>>>>>> 2014-09-05 10:44:50.365016 mon.0 [INF] pgmap v12582780: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 9139KB/s rd, 10734KB/s wr, 1566op/s
> >>>>>>>>>> 2014-09-05 10:44:51.385553 mon.0 [INF] pgmap v12582781: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 11493KB/s rd, 13827KB/s wr, 1928op/s
> >>>>>>>>>> 2014-09-05 10:44:52.389189 mon.0 [INF] pgmap v12582782: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 12922KB/s rd, 12814KB/s wr, 1670op/s
> >>>>>>>>>> 2014-09-05 10:44:53.403735 mon.0 [INF] pgmap v12582783: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 14880KB/s rd, 17532KB/s wr, 2145op/s
> >>>>>>>>>> 2014-09-05 10:44:54.410581 mon.0 [INF] pgmap v12582784: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 15419KB/s rd, 19696KB/s wr, 2259op/s
> >>>>>>>>>> 2014-09-05 10:44:55.420308 mon.0 [INF] pgmap v12582785: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 10964KB/s rd, 15739KB/s wr, 1701op/s
> >>>>>>>>>> 2014-09-05 10:44:56.445017 mon.0 [INF] pgmap v12582786: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 16195KB/s rd, 15554KB/s wr, 2444op/s
> >>>>>>>>>> 2014-09-05 10:44:57.452136 mon.0 [INF] pgmap v12582787: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 15016KB/s rd, 16356KB/s wr, 2358op/s
> >>>>>>>>>> 2014-09-05 10:44:58.465958 mon.0 [INF] pgmap v12582788: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 11668KB/s rd, 18443KB/s wr, 2029op/s
> >>>>>>>>>> 2014-09-05 10:44:59.483462 mon.0 [INF] pgmap v12582789: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 12253KB/s rd, 10846KB/s wr, 1529op/s
> >>>>>>>>>> 2014-09-05 10:45:00.492322 mon.0 [INF] pgmap v12582790: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 12247KB/s rd, 7084KB/s wr, 1464op/s
> >>>>>>>>>> 2014-09-05 10:45:01.516581 mon.0 [INF] pgmap v12582791: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 16460KB/s rd, 12089KB/s wr, 2537op/s
> >>>>>>>>>> 2014-09-05 10:45:02.527110 mon.0 [INF] pgmap v12582792: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 13382KB/s rd, 15080KB/s wr, 2563op/s
> >>>>>>>>>> 2014-09-05 10:45:03.538090 mon.0 [INF] pgmap v12582793: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 10902KB/s rd, 18745KB/s wr, 2863op/s
> >>>>>>>>>> 2014-09-05 10:45:04.558261 mon.0 [INF] pgmap v12582794: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 10850KB/s rd, 15995KB/s wr, 2695op/s
> >>>>>>>>>> 2014-09-05 10:45:05.565750 mon.0 [INF] pgmap v12582795: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 9636KB/s rd, 13262KB/s wr, 2372op/s
> >>>>>>>>>> 2014-09-05 10:45:06.593984 mon.0 [INF] pgmap v12582796: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 18808KB/s rd, 19329KB/s wr, 3819op/s
> >>>>>>>>>> 2014-09-05 10:45:07.595866 mon.0 [INF] pgmap v12582797: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12253 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 21265KB/s rd, 20743KB/s wr, 3861op/s
> >>>>>>>>>> 2014-09-05 10:45:08.624949 mon.0 [INF] pgmap v12582798: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12254 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 20114KB/s rd, 18543KB/s wr, 3248op/s
> >>>>>>>>>> 2014-09-05 10:45:09.627901 mon.0 [INF] pgmap v12582799: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12254 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 14717KB/s rd, 15141KB/s wr, 2302op/s
> >>>>>>>>>> 2014-09-05 10:45:10.643234 mon.0 [INF] pgmap v12582800: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12254 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 8328KB/s rd, 13950KB/s wr, 1919op/s
> >>>>>>>>>> 2014-09-05 10:45:11.651602 mon.0 [INF] pgmap v12582801: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12254 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 16978KB/s rd, 15921KB/s wr, 3377op/s
> >>>>>>>>>> 2014-09-05 10:45:12.674819 mon.0 [INF] pgmap v12582802: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12254 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 16471KB/s rd, 14034KB/s wr, 3379op/s
> >>>>>>>>>> 2014-09-05 10:45:13.688080 mon.0 [INF] pgmap v12582803: 6860
> >>>>>>>>>> pgs: 6860 active+clean; 12254 GB data, 36574 GB used, 142
> >>>>>>>>>> TB / 178 TB avail; 16149KB/s rd, 12657KB/s wr, 2734op/s
> >>>>>>>>>> 
> >>>>>>>>>> Aye, we actually saw latency on the disks go up a bit when we
> >>>>>>>>>> had 128GB of RAM on the OSDs and decided to beef them up to
> >>>>>>>>>> 256GB which helped. They?re running different workloads
> >>>>>>>>>> (shared hosting) but we?ve never encountered the issue we had
> >>>>>>>>>> yesterday even during our testing/benchmarking.
> >>>>>>>>>> 
> >>>>>>>>>> Kind Regards,
> >>>>>>>>>> David
> >>>>>>>>>> 
> >>>>>>>>>> 5 sep 2014 kl. 09:05 skrev Christian Balzer <chibi at gol.com>:
> >>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> Hello,
> >>>>>>>>>>> 
> >>>>>>>>>>> On Fri, 5 Sep 2014 08:26:47 +0200 David wrote:
> >>>>>>>>>>> 
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Sorry for the lack of information yesterday, this was
> >>>>>>>>>>>> "solved" after some 30 minutes, after having
> >>>>>>>>>>>> reloaded/restarted all osd daemons. Unfortunately we
> >>>>>>>>>>>> couldn?t pin point it to a single OSD or drive, all drives
> >>>>>>>>>>>> seemed ok, some had a bit higher latency and we tried to
> >>>>>>>>>>>> out / in them to see if it had any effect which it did not.
> >>>>>>>>>>>> 
> >>>>>>>>>>> This is odd.
> >>>>>>>>>>> Having it "fixed" by restarting all OSDs would suggest
> >>>>>>>>>>> either a software problem (bug) with Ceph or some resource
> >>>>>>>>>>> other than the storage system being starved. But memory
> >>>>>>>>>>> seems unlikely, even with bloated, leaking OSD daemon. And
> >>>>>>>>>>> CPU seems even less likely.
> >>>>>>>>>>> 
> >>>>>>>>>>>> The cluster consists of 3 mon servrers, 5 OSD servers with
> >>>>>>>>>>>> 10 enterprise HDDs backed with 2 S3700 SSDs for journals
> >>>>>>>>>>>> each. OSD servers have 256GB of RAM, 2x E5-2630 v2 @
> >>>>>>>>>>>> 2.60GHz CPUs.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> The log that I posted yesterday was just a small taste of
> >>>>>>>>>>>> the full one ;) They were all pointing to different osd?s
> >>>>>>>>>>>> that they were waiting for. We?re also monitoring all of
> >>>>>>>>>>>> the VMs running on KVM, and we didn?t see any exceptional
> >>>>>>>>>>>> throughput or iops usage before or during this event. We
> >>>>>>>>>>>> were checking iostat etc and nothing was out of the
> >>>>>>>>>>>> ordinary..
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Going to double check SMART and also see if we can off load
> >>>>>>>>>>>> some of the cluster in any way. If you have any other advice
> >>>>>>>>>>>> that?d be appreciated :)
> >>>>>>>>>>>> 
> >>>>>>>>>>> If the mon output below is typical for your cluster (634KB/s
> >>>>>>>>>>> rd, 487KB/s wr, 90op/s) it should be able to handle that w/o
> >>>>>>>>>>> breaking a sweat. I guestimate it at about 13K write IOPS and
> >>>>>>>>>>> your reads should (after a while) come mostly from the nice
> >>>>>>>>>>> big pagecache you have on those 256GB RAM machines.
> >>>>>>>>>>> 
> >>>>>>>>>>> Christian
> >>>>>>>>>>> 
> >>>>>>>>>>>> Thanks for your help!
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Kind Regards,
> >>>>>>>>>>>> David
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 5 sep 2014 kl. 07:30 skrev Martin B Nielsen
> >>>>>>>>>>>> <martin at unity3d.com>:
> >>>>>>>>>>>> 
> >>>>>>>>>>>>> Just echoing what Christian said.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> Also, iirc the "currently waiting for subobs on [" could
> >>>>>>>>>>>>> also mean a problem on those as it waits for ack from them
> >>>>>>>>>>>>> (I might remember wrong).
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> If that is the case you might want to check in on osd 13 &
> >>>>>>>>>>>>> 37 as well.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> With the cluster load and size you should not have this
> >>>>>>>>>>>>> problem; I'm pretty sure you're dealing with a rogue/faulty
> >>>>>>>>>>>>> osd/node somewhere.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>> Martin
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> On Fri, Sep 5, 2014 at 2:28 AM, Christian Balzer
> >>>>>>>>>>>>> <chibi at gol.com> wrote: On Thu, 4 Sep 2014 12:02:13 +0200
> >>>>>>>>>>>>> David wrote:
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> We?re running a ceph cluster with version:
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> 0.67.7-1~bpo70+1
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> All of a sudden we?re having issues with the cluster
> >>>>>>>>>>>>>> (running RBD images for kvm) with slow requests on all of
> >>>>>>>>>>>>>> the OSD servers. Any idea why and how to fix it?
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>> You give us a Ceph version at least, but for anybody to
> >>>>>>>>>>>>> make guesses we need much more information than a log spew.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> How many nodes/OSDs, OS, hardware, OSD details (FS,
> >>>>>>>>>>>>> journals on SSDs), etc.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> Run atop (in a sufficiently large terminal) on all your
> >>>>>>>>>>>>> nodes, see if you can spot a bottleneck, like a disk being
> >>>>>>>>>>>>> at 100% all the time with a much higher avio than the
> >>>>>>>>>>>>> others. Looking at your logs, I'd pay particular attention
> >>>>>>>>>>>>> to the disk holding osd.22.
> >>>>>>>>>>>>> A single slow disk can bring a whole large cluster to a
> >>>>>>>>>>>>> crawl. If you're using a hardware controller with a
> >>>>>>>>>>>>> battery backed up cache, check if that is fine, loss of
> >>>>>>>>>>>>> the battery would switch from writeback to writethrough
> >>>>>>>>>>>>> and massively slow down IOPS.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> Christian
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> 2014-09-04 11:56:35.868521 mon.0 [INF] pgmap v12504451:
> >>>>>>>>>>>>>> 6860 pgs: 6860 active+clean; 12163 GB data, 36308 GB
> >>>>>>>>>>>>>> used, 142 TB / 178 TB avail; 634KB/s rd, 487KB/s wr,
> >>>>>>>>>>>>>> 90op/s 2014-09-04 11:56:29.510270 osd.22 [WRN] 15 slow
> >>>>>>>>>>>>>> requests, 1 included below; oldest blocked for
> >>>>>>>>>>>>>>> 44.745754 secs 2014-09-04 11:56:29.510276 osd.22 [WRN]
> >>>>>>>>>>>>>>> slow request 30.999821 seconds
> >>>>>>>>>>>>>> old, received at 2014-09-04 11:55:58.510424:
> >>>>>>>>>>>>>> osd_op(client.10731617.0:81868956
> >>>>>>>>>>>>>> rbd_data.967e022eb141f2.0000000000000e72 [write 0~4194304]
> >>>>>>>>>>>>>> 3.c585cebe e13901) v4 currently waiting for subops from
> >>>>>>>>>>>>>> [37,13] 2014-09-04 11:56:30.510528 osd.22 [WRN] 21 slow
> >>>>>>>>>>>>>> requests, 6 included below; oldest blocked for > 45.745989
> >>>>>>>>>>>>>> secs 2014-09-04 11:56:30.510534 osd.22 [WRN] slow request
> >>>>>>>>>>>>>> 30.122555 seconds old, received at 2014-09-04
> >>>>>>>>>>>>>> 11:56:00.387925: osd_op(client.13425082.0:11962345
> >>>>>>>>>>>>>> rbd_data.54f24c3d1b58ba.0000000000003753 [stat,write
> >>>>>>>>>>>>>> 1114112~8192] 3.c9e49140 e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [13,42] 2014-09-04 11:56:30.510537 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 30.122362 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:00.388118:
> >>>>>>>>>>>>>> osd_op(client.13425082.0:11962352
> >>>>>>>>>>>>>> rbd_data.54f24c3d1b58ba.0000000000003753 [stat,write
> >>>>>>>>>>>>>> 1134592~4096] 3.c9e49140 e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [13,42] 2014-09-04 11:56:30.510541 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 30.122298 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:00.388182:
> >>>>>>>>>>>>>> osd_op(client.13425082.0:11962353
> >>>>>>>>>>>>>> rbd_data.54f24c3d1b58ba.0000000000003753 [stat,write
> >>>>>>>>>>>>>> 4046848~8192] 3.c9e49140 e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [13,42] 2014-09-04 11:56:30.510544 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 30.121577 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:00.388903:
> >>>>>>>>>>>>>> osd_op(client.13425082.0:11962374
> >>>>>>>>>>>>>> rbd_data.54f24c3d1b58ba.00000000000047f2 [stat,write
> >>>>>>>>>>>>>> 2527232~4096] 3.cd9a9015 e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [45,1] 2014-09-04 11:56:30.510548 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 30.121518 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:00.388962:
> >>>>>>>>>>>>>> osd_op(client.13425082.0:11962375
> >>>>>>>>>>>>>> rbd_data.54f24c3d1b58ba.00000000000047f2 [stat,write
> >>>>>>>>>>>>>> 3133440~4096] 3.cd9a9015 e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [45,1] 2014-09-04 11:56:31.510706 osd.22
> >>>>>>>>>>>>>> [WRN] 26 slow requests, 6 included below; oldest blocked
> >>>>>>>>>>>>>> for > 46.746163 secs 2014-09-04 11:56:31.510711 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 31.035418 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:00.475236:
> >>>>>>>>>>>>>> osd_op(client.9266625.0:135900595
> >>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>> 2097152~4096] 3.a2894ebe e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:31.510715 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 31.035335 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:00.475319:
> >>>>>>>>>>>>>> osd_op(client.9266625.0:135900596
> >>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>> 2162688~4096] 3.a2894ebe e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:31.510718 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 31.035270 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:00.475384:
> >>>>>>>>>>>>>> osd_op(client.9266625.0:135900597
> >>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>> 2400256~16384] 3.a2894ebe e13901) v4 currently waiting
> >>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:31.510721 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 31.035093 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:00.475561:
> >>>>>>>>>>>>>> osd_op(client.9266625.0:135900598
> >>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>> 2420736~4096] 3.a2894ebe e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:31.510724 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 31.034990 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:00.475664:
> >>>>>>>>>>>>>> osd_op(client.9266625.0:135900599
> >>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>> 2428928~12288] 3.a2894ebe e13901) v4 currently waiting
> >>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:32.510864 osd.22
> >>>>>>>>>>>>>> [WRN] 28 slow requests, 6 included below; oldest blocked
> >>>>>>>>>>>>>> for > 47.746324 secs 2014-09-04 11:56:32.510871 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 32.035026 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:00.475789:
> >>>>>>>>>>>>>> osd_op(client.9266625.0:135900600
> >>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>> 2453504~4096] 3.a2894ebe e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:32.510876 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 32.034982 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:00.475833:
> >>>>>>>>>>>>>> osd_op(client.9266625.0:135900601
> >>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>> 2502656~16384] 3.a2894ebe e13901) v4 currently waiting
> >>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:32.510881 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 32.034917 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:00.475898:
> >>>>>>>>>>>>>> osd_op(client.9266625.0:135900602
> >>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>> 2600960~4096] 3.a2894ebe e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:32.510885 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 32.034807 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:00.476008:
> >>>>>>>>>>>>>> osd_op(client.9266625.0:135900603
> >>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>> 2617344~4096] 3.a2894ebe e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:32.510890 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 32.034756 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:00.476059:
> >>>>>>>>>>>>>> osd_op(client.9266625.0:135900604
> >>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>> 2637824~69632] 3.a2894ebe e13901) v4 currently waiting
> >>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:32.723329 osd.17
> >>>>>>>>>>>>>> [WRN] 4 slow requests, 4 included below; oldest blocked
> >>>>>>>>>>>>>> for > 30.605794 secs 2014-09-04 11:56:32.723335 osd.17
> >>>>>>>>>>>>>> [WRN] slow request 30.605794 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:02.117495:
> >>>>>>>>>>>>>> osd_op(client.12646411.0:31118135
> >>>>>>>>>>>>>> rbd_data.81aaf52eb141f2.0000000000000e4b [stat,write
> >>>>>>>>>>>>>> 0~4194304] 3.716b373b e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [31,1] 2014-09-04 11:56:32.723338 osd.17
> >>>>>>>>>>>>>> [WRN] slow request 30.362042 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:02.361247:
> >>>>>>>>>>>>>> osd_op(client.12485793.0:121786489
> >>>>>>>>>>>>>> rbd_data.4ae87f2eb141f2.000000000000e200 [stat,write
> >>>>>>>>>>>>>> 2834432~4096] 3.6b43de72 e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [5,31] 2014-09-04 11:56:32.723341 osd.17
> >>>>>>>>>>>>>> [WRN] slow request 30.179681 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:02.543608:
> >>>>>>>>>>>>>> osd_op(client.13425079.0:7014393
> >>>>>>>>>>>>>> rbd_data.54f2522eb141f2.0000000000014d01 [stat,write
> >>>>>>>>>>>>>> 2502656~4096] 3.d8ceea89 e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [31,18] 2014-09-04 11:56:32.723344 osd.17
> >>>>>>>>>>>>>> [WRN] slow request 30.000568 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:02.722721:
> >>>>>>>>>>>>>> osd_op(client.9281279.0:144230325
> >>>>>>>>>>>>>> rbd_data.2d99fc2eb141f2.000000000000e80a [stat,write
> >>>>>>>>>>>>>> 1589248~4096] 3.16dc0e89 e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [31,18] 2014-09-04 11:56:33.511043 osd.22
> >>>>>>>>>>>>>> [WRN] 33 slow requests, 6 included below; oldest blocked
> >>>>>>>>>>>>>> for > 48.746486 secs 2014-09-04 11:56:33.511049 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 33.034756 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:00.476221:
> >>>>>>>>>>>>>> osd_op(client.9266625.0:135900605
> >>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>> 2715648~8192] 3.a2894ebe e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:33.511053 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 33.034620 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:00.476357:
> >>>>>>>>>>>>>> osd_op(client.9266625.0:135900606
> >>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>> 2756608~4096] 3.a2894ebe e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:33.511056 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 33.034446 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:00.476531:
> >>>>>>>>>>>>>> osd_op(client.9266625.0:135900607
> >>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>> 2809856~4096] 3.a2894ebe e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:33.511060 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 33.034387 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:00.476590:
> >>>>>>>>>>>>>> osd_op(client.9266625.0:135900608
> >>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>> 2883584~4096] 3.a2894ebe e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:33.511063 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 33.034341 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:00.476636:
> >>>>>>>>>>>>>> osd_op(client.9266625.0:135900609
> >>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>> 2891776~8192] 3.a2894ebe e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:34.511206 osd.22
> >>>>>>>>>>>>>> [WRN] 38 slow requests, 6 included below; oldest blocked
> >>>>>>>>>>>>>> for > 49.746664 secs 2014-09-04 11:56:34.511215 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 34.034426 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:00.476729:
> >>>>>>>>>>>>>> osd_op(client.9266625.0:135900610
> >>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>> 2908160~4096] 3.a2894ebe e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:34.511219 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 34.034298 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:00.476857:
> >>>>>>>>>>>>>> osd_op(client.9266625.0:135900611
> >>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>> 2932736~28672] 3.a2894ebe e13901) v4 currently waiting
> >>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:34.511222 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 34.034234 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:00.476921:
> >>>>>>>>>>>>>> osd_op(client.9266625.0:135900612
> >>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>> 2969600~8192] 3.a2894ebe e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:34.511225 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 34.034147 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:00.477008:
> >>>>>>>>>>>>>> osd_op(client.9266625.0:135900613
> >>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>> 2981888~4096] 3.a2894ebe e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:34.511228 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 34.034055 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:00.477100:
> >>>>>>>>>>>>>> osd_op(client.9266625.0:135900614
> >>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>> 2998272~12288] 3.a2894ebe e13901) v4 currently waiting
> >>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:35.153813 osd.37
> >>>>>>>>>>>>>> [WRN] 1 slow requests, 1 included below; oldest blocked
> >>>>>>>>>>>>>> for > 30.307864 secs 2014-09-04 11:56:35.153820 osd.37
> >>>>>>>>>>>>>> [WRN] slow request 30.307864 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:04.845917:
> >>>>>>>>>>>>>> osd_op(client.10785413.0:129409910
> >>>>>>>>>>>>>> rbd_data.96803f2eb141f2.00000000000042b0 [stat,write
> >>>>>>>>>>>>>> 1712128~4096] 3.6c5ac950 e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [22,41] 2014-09-04 11:56:35.511450 osd.22
> >>>>>>>>>>>>>> [WRN] 42 slow requests, 6 included below; oldest blocked
> >>>>>>>>>>>>>> for > 50.746638 secs 2014-09-04 11:56:35.511455 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 35.034219 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:00.477178:
> >>>>>>>>>>>>>> osd_op(client.9266625.0:135900615
> >>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>> 3022848~8192] 3.a2894ebe e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:35.511459 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 35.034143 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:00.477254:
> >>>>>>>>>>>>>> osd_op(client.9266625.0:135900616
> >>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>> 3035136~12288] 3.a2894ebe e13901) v4 currently waiting
> >>>>>>>>>>>>>> for subops from [37,13] 2014-09-04 11:56:35.511462 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 35.034082 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:00.477315:
> >>>>>>>>>>>>>> osd_op(client.9266625.0:135900617
> >>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>> 3055616~4096] 3.a2894ebe e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:35.511465 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 35.033965 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:00.477432:
> >>>>>>>>>>>>>> osd_op(client.9266625.0:135900618
> >>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
> >>>>>>>>>>>>>> 3063808~4096] 3.a2894ebe e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:35.511469 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 34.162275 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:01.349122:
> >>>>>>>>>>>>>> osd_op(client.12485646.0:110385143
> >>>>>>>>>>>>>> rbd_data.4ccab12eb141f2.0000000000015b66 [stat,write
> >>>>>>>>>>>>>> 909312~4096] 3.9c416abe e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:36.511603 osd.22
> >>>>>>>>>>>>>> [WRN] 34 slow requests, 6 included below; oldest blocked
> >>>>>>>>>>>>>> for > 51.746797 secs 2014-09-04 11:56:36.511609 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 34.401476 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:02.110080:
> >>>>>>>>>>>>>> osd_op(client.12485643.0:117955489
> >>>>>>>>>>>>>> rbd_data.4aad332eb141f2.0000000000002c00 [stat,write
> >>>>>>>>>>>>>> 2531328~4096] 3.b985ec92 e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [33,1] 2014-09-04 11:56:36.511612 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 34.149511 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:02.362045:
> >>>>>>>>>>>>>> osd_op(client.12485793.0:121786505
> >>>>>>>>>>>>>> rbd_data.4ae87f2eb141f2.0000000000018a0a [stat,write
> >>>>>>>>>>>>>> 3342336~4096] 3.26907492 e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [33,1] 2014-09-04 11:56:36.511615 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 34.329528 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:02.182028:
> >>>>>>>>>>>>>> osd_op(client.12646411.0:31118266
> >>>>>>>>>>>>>> rbd_data.81aaf52eb141f2.0000000000000e4f [stat,write
> >>>>>>>>>>>>>> 0~4194304] 3.168c978d e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [49,29] 2014-09-04 11:56:36.511618 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 31.871976 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:04.639580:
> >>>>>>>>>>>>>> osd_op(client.12485793.0:121786506
> >>>>>>>>>>>>>> rbd_data.4ae87f2eb141f2.0000000000018a0a [stat,write
> >>>>>>>>>>>>>> 0~4096] 3.26907492 e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [33,1] 2014-09-04 11:56:36.511621 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 31.008867 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:05.502689:
> >>>>>>>>>>>>>> osd_op(client.9266625.0:135901516
> >>>>>>>>>>>>>> rbd_data.42d6792eb141f2.0000000000017436 [stat,write
> >>>>>>>>>>>>>> 2682880~4096] 3.32067b97 e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [12,29] 2014-09-04 11:56:36.723882 osd.17
> >>>>>>>>>>>>>> [WRN] 5 slow requests, 1 included below; oldest blocked
> >>>>>>>>>>>>>> for > 34.606349 secs 2014-09-04 11:56:36.723887 osd.17
> >>>>>>>>>>>>>> [WRN] slow request 30.603057 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:06.120787:
> >>>>>>>>>>>>>> osd_op(client.12483520.0:11473870
> >>>>>>>>>>>>>> rbd_data.4b8e9b3d1b58ba.0000000000004600 [stat,write
> >>>>>>>>>>>>>> 2293760~4096] 3.49398689 e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [31,18] 2014-09-04 11:56:37.511787 osd.22
> >>>>>>>>>>>>>> [WRN] 24 slow requests, 6 included below; oldest blocked
> >>>>>>>>>>>>>> for > 52.746984 secs 2014-09-04 11:56:37.511793 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 32.008986 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:05.502757:
> >>>>>>>>>>>>>> osd_op(client.9266625.0:135901517
> >>>>>>>>>>>>>> rbd_data.42d6792eb141f2.0000000000017436 [stat,write
> >>>>>>>>>>>>>> 2695168~4096] 3.32067b97 e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [12,29] 2014-09-04 11:56:37.511798 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 31.749483 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:05.762260:
> >>>>>>>>>>>>>> osd_op(client.9281282.0:55728036
> >>>>>>>>>>>>>> rbd_data.2d99f62eb141f2.0000000000003753 [stat,write
> >>>>>>>>>>>>>> 1630208~8192] 3.6625df97 e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [12,29] 2014-09-04 11:56:37.511801 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 31.745558 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:05.766185:
> >>>>>>>>>>>>>> osd_op(client.9281282.0:55728038
> >>>>>>>>>>>>>> rbd_data.2d99f62eb141f2.0000000000003753 [stat,write
> >>>>>>>>>>>>>> 1941504~4096] 3.6625df97 e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [12,29] 2014-09-04 11:56:37.511803 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 31.390945 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:06.120798:
> >>>>>>>>>>>>>> osd_op(client.12483520.0:11473871
> >>>>>>>>>>>>>> rbd_data.4b8e9b3d1b58ba.0000000000005c00 [stat,write
> >>>>>>>>>>>>>> 2424832~8192] 3.47dbbb97 e13901) v4 currently waiting for
> >>>>>>>>>>>>>> subops from [12,29] 2014-09-04 11:56:37.511806 osd.22
> >>>>>>>>>>>>>> [WRN] slow request 31.390357 seconds old, received at
> >>>>>>>>>>>>>> 2014-09-04 11:56:06.121386:
> >>>>>>>>>>>>>> osd_op(client.12483520.0:11473884
> >>>>>>>>>>>>>> rbd_data.4b8e9b3d1b58ba.0000000000005c00 [stat,write
> >>>>>>>>>>>>>> 2457600~16384] 3.47dbbb97 e13901) v4 currently waiting
> >>>>>>>>>>>>>> for subops from [12,29]
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Kind Regards,
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> David
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>>> ceph-users mailing list
> >>>>>>>>>>>>>> ceph-users at lists.ceph.com
> >>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> Christian Balzer        Network/Systems Engineer
> >>>>>>>>>>>>> chibi at gol.com           Global OnLine Japan/Fusion
> >>>>>>>>>>>>> Communications http://www.gol.com/
> >>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>> ceph-users mailing list
> >>>>>>>>>>>>> ceph-users at lists.ceph.com
> >>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>> ceph-users mailing list
> >>>>>>>>>>>>> ceph-users at lists.ceph.com
> >>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> --
> >>>>>>>>>>> Christian Balzer        Network/Systems Engineer
> >>>>>>>>>>> chibi at gol.com         Global OnLine Japan/Fusion
> >>>>>>>>>>> Communications http://www.gol.com/
> >>>>>>>>>>> _______________________________________________
> >>>>>>>>>>> ceph-users mailing list
> >>>>>>>>>>> ceph-users at lists.ceph.com
> >>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>>>>> 
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> ceph-users mailing list
> >>>>>>>>>> ceph-users at lists.ceph.com
> >>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> -- 
> >>>>>>>>>> Luis Periquito
> >>>>>>>>>> Unix Engineer
> >>>>>>>>>> 
> >>>>>>>>>> Ocado.com
> >>>>>>>>>> 
> >>>>>>>>>> Head Office, Titan Court, 3 Bishop Square, Hatfield Business
> >>>>>>>>>> Park, Hatfield, Herts AL10 9NE
> >>>>>>>>>> 
> >>>>>>>>>> Notice:  This email is confidential and may contain copyright
> >>>>>>>>>> material of members of the Ocado Group. Opinions and views
> >>>>>>>>>> expressed in this message may not necessarily reflect the
> >>>>>>>>>> opinions and views of the members of the Ocado Group.
> >>>>>>>>>> 
> >>>>>>>>>> If you are not the intended recipient, please notify us
> >>>>>>>>>> immediately and delete all copies of this message. Please note
> >>>>>>>>>> that it is your responsibility to scan this message for
> >>>>>>>>>> viruses. 
> >>>>>>>>>> 
> >>>>>>>>>> References to the ?Ocado Group? are to Ocado Group plc
> >>>>>>>>>> (registered in England and Wales with number 7098618) and its
> >>>>>>>>>> subsidiary undertakings (as that expression is defined in the
> >>>>>>>>>> Companies Act 2006) from time to time.  The registered office
> >>>>>>>>>> of Ocado Group plc is Titan Court, 3 Bishops Square, Hatfield
> >>>>>>>>>> Business Park, Hatfield, Herts. AL10 9NE.
> >>>>>>>>>> 
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> ceph-users mailing list
> >>>>>>>>>> ceph-users at lists.ceph.com
> >>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>>>> 
> >>>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> -- 
> >>>>>>> Christian Balzer        Network/Systems Engineer                
> >>>>>>> chibi at gol.com   	Global OnLine Japan/Fusion Communications
> >>>>>>> http://www.gol.com/
> >>>>>> 
> >>>>>> _______________________________________________
> >>>>>> ceph-users mailing list
> >>>>>> ceph-users at lists.ceph.com
> >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>> 
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list
> >>>>> ceph-users at lists.ceph.com
> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>> 
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-users at lists.ceph.com
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>> 
> >>> 
> >> 
> >> 
> >> -- 
> >> Christian Balzer        Network/Systems Engineer                
> >> chibi at gol.com   	Global OnLine Japan/Fusion Communications
> >> http://www.gol.com/
> > 
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users at lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/