Huge issues with slow requests

josef@xxxxxxxxxxx (Josef Johansson) · Sat, 6 Sep 2014 17:52:59 +0200

Hi,

Just realised that it could also be with a popularity bug as well and lots a small traffic. And seeing that it?s fast it gets popular until it hits the curb.

I?m seeing this in the stats I think.

Linux 3.13-0.bpo.1-amd64 (osd1) 	09/06/2014 	_x86_64_	(24 CPU)

09/06/2014 05:48:41 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.21    0.00    1.00    2.86    0.00   93.93

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdm               0.02     1.47    7.05   42.72     0.67     1.07    71.43     0.41    8.17    6.41    8.46   3.44  17.13
sdn               0.03     1.42    6.17   37.08     0.57     0.92    70.51     0.08    1.76    6.47    0.98   3.46  14.98
sdg               0.03     1.44    6.27   36.62     0.56     0.94    71.40     0.34    8.00    6.83    8.20   3.45  14.78
sde               0.03     1.23    6.47   39.07     0.59     0.98    70.29     0.43    9.47    6.57    9.95   3.37  15.33
sdf               0.02     1.26    6.47   33.77     0.61     0.87    75.30     0.22    5.39    6.00    5.27   3.52  14.17
sdl               0.03     1.44    6.44   40.54     0.59     1.08    72.68     0.21    4.49    6.56    4.16   3.40  15.95
sdk               0.03     1.41    5.62   35.92     0.52     0.90    70.10     0.15    3.58    6.17    3.17   3.45  14.32
sdj               0.03     1.26    6.30   34.23     0.57     0.83    70.84     0.31    7.65    6.56    7.85   3.48  14.10

Seeing that the drives are in pretty good shape but not giving lotsa read, I would assume that I need to tweak the cache to swallow more IO.

When I tweaked it before production I did not see any performance gains what so ever, so they are pretty low. And it?s odd because we just saw these problems a little while ago. So probably that we hit a limit where the disks are getting lot of IO.

I know that there?s some threads about this that I will read again.

Thanks for the hints in looking at bad drives.

Regards,
Josef

On 06 Sep 2014, at 17:41, Josef Johansson <josef at oderland.se> wrote:

> Hi,
> 
> On 06 Sep 2014, at 17:27, Christian Balzer <chibi at gol.com> wrote:
> 
>> 
>> Hello,
>> 
>> On Sat, 6 Sep 2014 17:10:11 +0200 Josef Johansson wrote:
>> 
>>> We manage to go through the restore, but the performance degradation is
>>> still there.
>>> 
>> Manifesting itself how?
>> 
> Awful slow io on the VMs, and iowait, it?s about 2MB/s or so.
> But mostly a lot of iowait.
> 
>>> Looking through the OSDs to pinpoint a source of the degradation and
>>> hoping the current load will be lowered.
>>> 
>> 
>> You're the one looking at your cluster, the iostat, atop, iotop and
>> whatnot data.
>> If one particular OSD/disk stands out, investigate it, as per the "Good
>> way to monitor detailed latency/throughput" thread. 
>> 
> Will read it through.
>> If you have a spare and idle machine that is identical to your storage
>> nodes, you could run a fio benchmark on a disk there and then compare the
>> results to that of your suspect disk after setting your cluster to noout
>> and stopping that particular OSD.
> No spare though, but I have a rough idea what it should be, what?s I?m going at right now.
> Right, so the cluster should be fine after I stop the OSD right? I though of stopping it a little bit to see if the IO was better afterwards from within the VMs. Not sure how good effect it makes though since it may be waiting for the IO to complete what not.
>> 
>>> I?m a bit afraid of doing the 0 to weight of an OSD, wouldn?t it be
>>> tough if the degradation is still there afterwards? i.e. if I set back
>>> the weight would it move back all the PGs?
>>> 
>> Of course.
>> 
>> Until you can determine that a specific OSD/disk is the culprit, don't do
>> that. 
>> If you have the evidence, go ahead.
>> 
> Great, that?s what I though as well.
>> Regards,
>> 
>> Christian
>> 
>>> Regards,
>>> Josef
>>> 
>>> On 06 Sep 2014, at 15:52, Josef Johansson <josef at oderland.se> wrote:
>>> 
>>>> FWI I did restart the OSDs until I saw a server that made impact.
>>>> Until that server stopped doing impact, I didn?t get lower in the
>>>> number objects being degraded. After a while it was done with
>>>> recovering that OSD and happily started with others. I guess I will be
>>>> seeing the same behaviour when it gets to replicating the same PGs
>>>> that were causing troubles the first time.
>>>> 
>>>> On 06 Sep 2014, at 15:04, Josef Johansson <josef at oderland.se> wrote:
>>>> 
>>>>> Actually, it only worked with restarting  for a period of time to get
>>>>> the recovering process going. Can?t get passed the 21k object mark.
>>>>> 
>>>>> I?m uncertain if the disk really is messing this up right now as
>>>>> well. So I?m not glad to start moving 300k objects around.
>>>>> 
>>>>> Regards,
>>>>> Josef
>>>>> 
>>>>> On 06 Sep 2014, at 14:33, Josef Johansson <josef at oderland.se> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> On 06 Sep 2014, at 13:53, Christian Balzer <chibi at gol.com> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> Hello,
>>>>>>> 
>>>>>>> On Sat, 6 Sep 2014 13:37:25 +0200 Josef Johansson wrote:
>>>>>>> 
>>>>>>>> Also putting this on the list.
>>>>>>>> 
>>>>>>>> On 06 Sep 2014, at 13:36, Josef Johansson <josef at oderland.se>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> Same issues again, but I think we found the drive that causes the
>>>>>>>>> problems.
>>>>>>>>> 
>>>>>>>>> But this is causing problems as it?s trying to do a recover to
>>>>>>>>> that osd at the moment.
>>>>>>>>> 
>>>>>>>>> So we?re left with the status message 
>>>>>>>>> 
>>>>>>>>> 2014-09-06 13:35:07.580007 mon.0 [INF] pgmap v12678802: 6860 pgs:
>>>>>>>>> 6841 active+clean, 19 active+remapped+backfilling; 12299 GB data,
>>>>>>>>> 36882 GB used, 142 TB / 178 TB avail; 1921KB/s rd, 192KB/s wr,
>>>>>>>>> 74op/s; 41424/15131923 degraded (0.274%);  recovering 0 o/s,
>>>>>>>>> 2035KB/s
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> It?s improving, but way too slowly. If I restart the recovery
>>>>>>>>> (ceph osd set no recovery /unset) it doesn?t change the osd what
>>>>>>>>> I can see.
>>>>>>>>> 
>>>>>>>>> Any ideas?
>>>>>>>>> 
>>>>>>> I don't know the state of your cluster, i.e. what caused the
>>>>>>> recovery to start (how many OSDs went down?).
>>>>>> Performance degradation, databases are the worst impacted. It?s
>>>>>> actually a OSD that we put in that?s causing it (removed it again
>>>>>> though). So the cluster in itself is healthy.
>>>>>> 
>>>>>>> If you have a replication of 3 and only one OSD was involved, what
>>>>>>> is stopping you from taking that wonky drive/OSD out?
>>>>>>> 
>>>>>> There?s data that goes missing if I do that, I guess I have to wait
>>>>>> for the recovery process to complete before I can go any further,
>>>>>> this is with rep 3.
>>>>>>> If you don't know that or want to play it safe, how about setting
>>>>>>> the weight of that OSD to 0? 
>>>>>>> While that will AFAICT still result in all primary PGs to be
>>>>>>> evacuated off it, no more writes will happen to it and reads might
>>>>>>> be faster. In either case, it shouldn't slow down the rest of your
>>>>>>> cluster anymore.
>>>>>>> 
>>>>>> That?s actually one idea I haven?t thought off, I wan?t to play it
>>>>>> safe right now and hope that it goes up again, I actually found one
>>>>>> wonky way of getting the recovery process from not stalling to a
>>>>>> grind, and that was restarting OSDs. One at the time.
>>>>>> 
>>>>>> Regards,
>>>>>> Josef
>>>>>>> Regards,
>>>>>>> 
>>>>>>> Christian
>>>>>>>>> Cheers,
>>>>>>>>> Josef
>>>>>>>>> 
>>>>>>>>> On 05 Sep 2014, at 11:26, Luis Periquito
>>>>>>>>> <luis.periquito at ocado.com> wrote:
>>>>>>>>> 
>>>>>>>>>> Only time I saw such behaviour was when I was deleting a big
>>>>>>>>>> chunk of data from the cluster: all the client activity was
>>>>>>>>>> reduced, the op/s were almost non-existent and there was
>>>>>>>>>> unjustified delays all over the cluster. But all the disks were
>>>>>>>>>> somewhat busy in atop/iotstat.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 5 September 2014 09:51, David <david at visions.se> wrote:
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> Indeed strange.
>>>>>>>>>> 
>>>>>>>>>> That output was when we had issues, seems that most operations
>>>>>>>>>> were blocked / slow requests.
>>>>>>>>>> 
>>>>>>>>>> A ?baseline? output is more like today:
>>>>>>>>>> 
>>>>>>>>>> 2014-09-05 10:44:29.123681 mon.0 [INF] pgmap v12582759: 6860 pgs:
>>>>>>>>>> 6860 active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 9273KB/s rd, 24650KB/s wr, 2755op/s 2014-09-05
>>>>>>>>>> 10:44:30.125637 mon.0 [INF] pgmap v12582760: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 9500KB/s rd, 20430KB/s wr, 2294op/s 2014-09-05
>>>>>>>>>> 10:44:31.139427 mon.0 [INF] pgmap v12582761: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 9216KB/s rd, 20062KB/s wr, 2488op/s 2014-09-05
>>>>>>>>>> 10:44:32.144945 mon.0 [INF] pgmap v12582762: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 12511KB/s rd, 15739KB/s wr, 2488op/s 2014-09-05
>>>>>>>>>> 10:44:33.161210 mon.0 [INF] pgmap v12582763: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 18593KB/s rd, 14880KB/s wr, 2609op/s 2014-09-05
>>>>>>>>>> 10:44:34.187294 mon.0 [INF] pgmap v12582764: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 17720KB/s rd, 22964KB/s wr, 3257op/s 2014-09-05
>>>>>>>>>> 10:44:35.190785 mon.0 [INF] pgmap v12582765: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 19230KB/s rd, 18901KB/s wr, 3199op/s 2014-09-05
>>>>>>>>>> 10:44:36.213535 mon.0 [INF] pgmap v12582766: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 17630KB/s rd, 18855KB/s wr, 3131op/s 2014-09-05
>>>>>>>>>> 10:44:37.220052 mon.0 [INF] pgmap v12582767: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 12262KB/s rd, 18627KB/s wr, 2595op/s 2014-09-05
>>>>>>>>>> 10:44:38.233357 mon.0 [INF] pgmap v12582768: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 17697KB/s rd, 17572KB/s wr, 2156op/s 2014-09-05
>>>>>>>>>> 10:44:39.239409 mon.0 [INF] pgmap v12582769: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 20300KB/s rd, 19735KB/s wr, 2197op/s 2014-09-05
>>>>>>>>>> 10:44:40.260423 mon.0 [INF] pgmap v12582770: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 14656KB/s rd, 15460KB/s wr, 2199op/s 2014-09-05
>>>>>>>>>> 10:44:41.269736 mon.0 [INF] pgmap v12582771: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 8969KB/s rd, 11918KB/s wr, 1951op/s 2014-09-05
>>>>>>>>>> 10:44:42.276192 mon.0 [INF] pgmap v12582772: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 7272KB/s rd, 10644KB/s wr, 1832op/s 2014-09-05
>>>>>>>>>> 10:44:43.291817 mon.0 [INF] pgmap v12582773: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 9316KB/s rd, 16610KB/s wr, 2412op/s 2014-09-05
>>>>>>>>>> 10:44:44.295469 mon.0 [INF] pgmap v12582774: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 9257KB/s rd, 19953KB/s wr, 2633op/s 2014-09-05
>>>>>>>>>> 10:44:45.315774 mon.0 [INF] pgmap v12582775: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 9718KB/s rd, 14298KB/s wr, 2101op/s 2014-09-05
>>>>>>>>>> 10:44:46.326783 mon.0 [INF] pgmap v12582776: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 20877KB/s rd, 12822KB/s wr, 2447op/s 2014-09-05
>>>>>>>>>> 10:44:47.327537 mon.0 [INF] pgmap v12582777: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 18447KB/s rd, 12945KB/s wr, 2226op/s 2014-09-05
>>>>>>>>>> 10:44:48.348725 mon.0 [INF] pgmap v12582778: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 15514KB/s rd, 19366KB/s wr, 1892op/s 2014-09-05
>>>>>>>>>> 10:44:49.357537 mon.0 [INF] pgmap v12582779: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 16640KB/s rd, 19694KB/s wr, 1887op/s 2014-09-05
>>>>>>>>>> 10:44:50.365016 mon.0 [INF] pgmap v12582780: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 9139KB/s rd, 10734KB/s wr, 1566op/s 2014-09-05
>>>>>>>>>> 10:44:51.385553 mon.0 [INF] pgmap v12582781: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 11493KB/s rd, 13827KB/s wr, 1928op/s 2014-09-05
>>>>>>>>>> 10:44:52.389189 mon.0 [INF] pgmap v12582782: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 12922KB/s rd, 12814KB/s wr, 1670op/s 2014-09-05
>>>>>>>>>> 10:44:53.403735 mon.0 [INF] pgmap v12582783: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 14880KB/s rd, 17532KB/s wr, 2145op/s 2014-09-05
>>>>>>>>>> 10:44:54.410581 mon.0 [INF] pgmap v12582784: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 15419KB/s rd, 19696KB/s wr, 2259op/s 2014-09-05
>>>>>>>>>> 10:44:55.420308 mon.0 [INF] pgmap v12582785: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 10964KB/s rd, 15739KB/s wr, 1701op/s 2014-09-05
>>>>>>>>>> 10:44:56.445017 mon.0 [INF] pgmap v12582786: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 16195KB/s rd, 15554KB/s wr, 2444op/s 2014-09-05
>>>>>>>>>> 10:44:57.452136 mon.0 [INF] pgmap v12582787: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 15016KB/s rd, 16356KB/s wr, 2358op/s 2014-09-05
>>>>>>>>>> 10:44:58.465958 mon.0 [INF] pgmap v12582788: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 11668KB/s rd, 18443KB/s wr, 2029op/s 2014-09-05
>>>>>>>>>> 10:44:59.483462 mon.0 [INF] pgmap v12582789: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 12253KB/s rd, 10846KB/s wr, 1529op/s 2014-09-05
>>>>>>>>>> 10:45:00.492322 mon.0 [INF] pgmap v12582790: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 12247KB/s rd, 7084KB/s wr, 1464op/s 2014-09-05
>>>>>>>>>> 10:45:01.516581 mon.0 [INF] pgmap v12582791: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 16460KB/s rd, 12089KB/s wr, 2537op/s 2014-09-05
>>>>>>>>>> 10:45:02.527110 mon.0 [INF] pgmap v12582792: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 13382KB/s rd, 15080KB/s wr, 2563op/s 2014-09-05
>>>>>>>>>> 10:45:03.538090 mon.0 [INF] pgmap v12582793: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 10902KB/s rd, 18745KB/s wr, 2863op/s 2014-09-05
>>>>>>>>>> 10:45:04.558261 mon.0 [INF] pgmap v12582794: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 10850KB/s rd, 15995KB/s wr, 2695op/s 2014-09-05
>>>>>>>>>> 10:45:05.565750 mon.0 [INF] pgmap v12582795: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 9636KB/s rd, 13262KB/s wr, 2372op/s 2014-09-05
>>>>>>>>>> 10:45:06.593984 mon.0 [INF] pgmap v12582796: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 18808KB/s rd, 19329KB/s wr, 3819op/s 2014-09-05
>>>>>>>>>> 10:45:07.595866 mon.0 [INF] pgmap v12582797: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12253 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 21265KB/s rd, 20743KB/s wr, 3861op/s 2014-09-05
>>>>>>>>>> 10:45:08.624949 mon.0 [INF] pgmap v12582798: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12254 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 20114KB/s rd, 18543KB/s wr, 3248op/s 2014-09-05
>>>>>>>>>> 10:45:09.627901 mon.0 [INF] pgmap v12582799: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12254 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 14717KB/s rd, 15141KB/s wr, 2302op/s 2014-09-05
>>>>>>>>>> 10:45:10.643234 mon.0 [INF] pgmap v12582800: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12254 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 8328KB/s rd, 13950KB/s wr, 1919op/s 2014-09-05
>>>>>>>>>> 10:45:11.651602 mon.0 [INF] pgmap v12582801: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12254 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 16978KB/s rd, 15921KB/s wr, 3377op/s 2014-09-05
>>>>>>>>>> 10:45:12.674819 mon.0 [INF] pgmap v12582802: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12254 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 16471KB/s rd, 14034KB/s wr, 3379op/s 2014-09-05
>>>>>>>>>> 10:45:13.688080 mon.0 [INF] pgmap v12582803: 6860 pgs: 6860
>>>>>>>>>> active+clean; 12254 GB data, 36574 GB used, 142 TB / 178 TB
>>>>>>>>>> avail; 16149KB/s rd, 12657KB/s wr, 2734op/s
>>>>>>>>>> 
>>>>>>>>>> Aye, we actually saw latency on the disks go up a bit when we had
>>>>>>>>>> 128GB of RAM on the OSDs and decided to beef them up to 256GB
>>>>>>>>>> which helped. They?re running different workloads (shared
>>>>>>>>>> hosting) but we?ve never encountered the issue we had yesterday
>>>>>>>>>> even during our testing/benchmarking.
>>>>>>>>>> 
>>>>>>>>>> Kind Regards,
>>>>>>>>>> David
>>>>>>>>>> 
>>>>>>>>>> 5 sep 2014 kl. 09:05 skrev Christian Balzer <chibi at gol.com>:
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Hello,
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, 5 Sep 2014 08:26:47 +0200 David wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> 
>>>>>>>>>>>> Sorry for the lack of information yesterday, this was "solved"
>>>>>>>>>>>> after some 30 minutes, after having reloaded/restarted all osd
>>>>>>>>>>>> daemons. Unfortunately we couldn?t pin point it to a single
>>>>>>>>>>>> OSD or drive, all drives seemed ok, some had a bit higher
>>>>>>>>>>>> latency and we tried to out / in them to see if it had any
>>>>>>>>>>>> effect which it did not.
>>>>>>>>>>>> 
>>>>>>>>>>> This is odd.
>>>>>>>>>>> Having it "fixed" by restarting all OSDs would suggest either a
>>>>>>>>>>> software problem (bug) with Ceph or some resource other than the
>>>>>>>>>>> storage system being starved. But memory seems unlikely, even
>>>>>>>>>>> with bloated, leaking OSD daemon. And CPU seems even less
>>>>>>>>>>> likely.
>>>>>>>>>>> 
>>>>>>>>>>>> The cluster consists of 3 mon servrers, 5 OSD servers with 10
>>>>>>>>>>>> enterprise HDDs backed with 2 S3700 SSDs for journals each. OSD
>>>>>>>>>>>> servers have 256GB of RAM, 2x E5-2630 v2 @ 2.60GHz CPUs.
>>>>>>>>>>>> 
>>>>>>>>>>>> The log that I posted yesterday was just a small taste of the
>>>>>>>>>>>> full one ;) They were all pointing to different osd?s that
>>>>>>>>>>>> they were waiting for. We?re also monitoring all of the VMs
>>>>>>>>>>>> running on KVM, and we didn?t see any exceptional throughput
>>>>>>>>>>>> or iops usage before or during this event. We were checking
>>>>>>>>>>>> iostat etc and nothing was out of the ordinary..
>>>>>>>>>>>> 
>>>>>>>>>>>> Going to double check SMART and also see if we can off load
>>>>>>>>>>>> some of the cluster in any way. If you have any other advice
>>>>>>>>>>>> that?d be appreciated :)
>>>>>>>>>>>> 
>>>>>>>>>>> If the mon output below is typical for your cluster (634KB/s rd,
>>>>>>>>>>> 487KB/s wr, 90op/s) it should be able to handle that w/o
>>>>>>>>>>> breaking a sweat. I guestimate it at about 13K write IOPS and
>>>>>>>>>>> your reads should (after a while) come mostly from the nice big
>>>>>>>>>>> pagecache you have on those 256GB RAM machines.
>>>>>>>>>>> 
>>>>>>>>>>> Christian
>>>>>>>>>>> 
>>>>>>>>>>>> Thanks for your help!
>>>>>>>>>>>> 
>>>>>>>>>>>> Kind Regards,
>>>>>>>>>>>> David
>>>>>>>>>>>> 
>>>>>>>>>>>> 5 sep 2014 kl. 07:30 skrev Martin B Nielsen
>>>>>>>>>>>> <martin at unity3d.com>:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Just echoing what Christian said.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Also, iirc the "currently waiting for subobs on [" could also
>>>>>>>>>>>>> mean a problem on those as it waits for ack from them (I might
>>>>>>>>>>>>> remember wrong).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If that is the case you might want to check in on osd 13 & 37
>>>>>>>>>>>>> as well.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> With the cluster load and size you should not have this
>>>>>>>>>>>>> problem; I'm pretty sure you're dealing with a rogue/faulty
>>>>>>>>>>>>> osd/node somewhere.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> Martin
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 2:28 AM, Christian Balzer
>>>>>>>>>>>>> <chibi at gol.com> wrote: On Thu, 4 Sep 2014 12:02:13 +0200
>>>>>>>>>>>>> David wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We?re running a ceph cluster with version:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 0.67.7-1~bpo70+1
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> All of a sudden we?re having issues with the cluster
>>>>>>>>>>>>>> (running RBD images for kvm) with slow requests on all of
>>>>>>>>>>>>>> the OSD servers. Any idea why and how to fix it?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> You give us a Ceph version at least, but for anybody to make
>>>>>>>>>>>>> guesses we need much more information than a log spew.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> How many nodes/OSDs, OS, hardware, OSD details (FS, journals
>>>>>>>>>>>>> on SSDs), etc.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Run atop (in a sufficiently large terminal) on all your nodes,
>>>>>>>>>>>>> see if you can spot a bottleneck, like a disk being at 100%
>>>>>>>>>>>>> all the time with a much higher avio than the others.
>>>>>>>>>>>>> Looking at your logs, I'd pay particular attention to the disk
>>>>>>>>>>>>> holding osd.22.
>>>>>>>>>>>>> A single slow disk can bring a whole large cluster to a crawl.
>>>>>>>>>>>>> If you're using a hardware controller with a battery backed up
>>>>>>>>>>>>> cache, check if that is fine, loss of the battery would switch
>>>>>>>>>>>>> from writeback to writethrough and massively slow down IOPS.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Christian
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2014-09-04 11:56:35.868521 mon.0 [INF] pgmap v12504451: 6860
>>>>>>>>>>>>>> pgs: 6860 active+clean; 12163 GB data, 36308 GB used, 142
>>>>>>>>>>>>>> TB / 178 TB avail; 634KB/s rd, 487KB/s wr, 90op/s 2014-09-04
>>>>>>>>>>>>>> 11:56:29.510270 osd.22 [WRN] 15 slow requests, 1 included
>>>>>>>>>>>>>> below; oldest blocked for
>>>>>>>>>>>>>>> 44.745754 secs 2014-09-04 11:56:29.510276 osd.22 [WRN] slow
>>>>>>>>>>>>>>> request 30.999821 seconds
>>>>>>>>>>>>>> old, received at 2014-09-04 11:55:58.510424:
>>>>>>>>>>>>>> osd_op(client.10731617.0:81868956
>>>>>>>>>>>>>> rbd_data.967e022eb141f2.0000000000000e72 [write 0~4194304]
>>>>>>>>>>>>>> 3.c585cebe e13901) v4 currently waiting for subops from
>>>>>>>>>>>>>> [37,13] 2014-09-04 11:56:30.510528 osd.22 [WRN] 21 slow
>>>>>>>>>>>>>> requests, 6 included below; oldest blocked for > 45.745989
>>>>>>>>>>>>>> secs 2014-09-04 11:56:30.510534 osd.22 [WRN] slow request
>>>>>>>>>>>>>> 30.122555 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:00.387925: osd_op(client.13425082.0:11962345
>>>>>>>>>>>>>> rbd_data.54f24c3d1b58ba.0000000000003753 [stat,write
>>>>>>>>>>>>>> 1114112~8192] 3.c9e49140 e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [13,42] 2014-09-04 11:56:30.510537 osd.22 [WRN]
>>>>>>>>>>>>>> slow request 30.122362 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:00.388118: osd_op(client.13425082.0:11962352
>>>>>>>>>>>>>> rbd_data.54f24c3d1b58ba.0000000000003753 [stat,write
>>>>>>>>>>>>>> 1134592~4096] 3.c9e49140 e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [13,42] 2014-09-04 11:56:30.510541 osd.22 [WRN]
>>>>>>>>>>>>>> slow request 30.122298 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:00.388182: osd_op(client.13425082.0:11962353
>>>>>>>>>>>>>> rbd_data.54f24c3d1b58ba.0000000000003753 [stat,write
>>>>>>>>>>>>>> 4046848~8192] 3.c9e49140 e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [13,42] 2014-09-04 11:56:30.510544 osd.22 [WRN]
>>>>>>>>>>>>>> slow request 30.121577 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:00.388903: osd_op(client.13425082.0:11962374
>>>>>>>>>>>>>> rbd_data.54f24c3d1b58ba.00000000000047f2 [stat,write
>>>>>>>>>>>>>> 2527232~4096] 3.cd9a9015 e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [45,1] 2014-09-04 11:56:30.510548 osd.22 [WRN]
>>>>>>>>>>>>>> slow request 30.121518 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:00.388962: osd_op(client.13425082.0:11962375
>>>>>>>>>>>>>> rbd_data.54f24c3d1b58ba.00000000000047f2 [stat,write
>>>>>>>>>>>>>> 3133440~4096] 3.cd9a9015 e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [45,1] 2014-09-04 11:56:31.510706 osd.22 [WRN]
>>>>>>>>>>>>>> 26 slow requests, 6 included below; oldest blocked for >
>>>>>>>>>>>>>> 46.746163 secs 2014-09-04 11:56:31.510711 osd.22 [WRN] slow
>>>>>>>>>>>>>> request 31.035418 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:00.475236: osd_op(client.9266625.0:135900595
>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
>>>>>>>>>>>>>> 2097152~4096] 3.a2894ebe e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:31.510715 osd.22 [WRN]
>>>>>>>>>>>>>> slow request 31.035335 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:00.475319: osd_op(client.9266625.0:135900596
>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
>>>>>>>>>>>>>> 2162688~4096] 3.a2894ebe e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:31.510718 osd.22 [WRN]
>>>>>>>>>>>>>> slow request 31.035270 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:00.475384: osd_op(client.9266625.0:135900597
>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
>>>>>>>>>>>>>> 2400256~16384] 3.a2894ebe e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:31.510721 osd.22 [WRN]
>>>>>>>>>>>>>> slow request 31.035093 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:00.475561: osd_op(client.9266625.0:135900598
>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
>>>>>>>>>>>>>> 2420736~4096] 3.a2894ebe e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:31.510724 osd.22 [WRN]
>>>>>>>>>>>>>> slow request 31.034990 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:00.475664: osd_op(client.9266625.0:135900599
>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
>>>>>>>>>>>>>> 2428928~12288] 3.a2894ebe e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:32.510864 osd.22 [WRN]
>>>>>>>>>>>>>> 28 slow requests, 6 included below; oldest blocked for >
>>>>>>>>>>>>>> 47.746324 secs 2014-09-04 11:56:32.510871 osd.22 [WRN] slow
>>>>>>>>>>>>>> request 32.035026 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:00.475789: osd_op(client.9266625.0:135900600
>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
>>>>>>>>>>>>>> 2453504~4096] 3.a2894ebe e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:32.510876 osd.22 [WRN]
>>>>>>>>>>>>>> slow request 32.034982 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:00.475833: osd_op(client.9266625.0:135900601
>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
>>>>>>>>>>>>>> 2502656~16384] 3.a2894ebe e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:32.510881 osd.22 [WRN]
>>>>>>>>>>>>>> slow request 32.034917 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:00.475898: osd_op(client.9266625.0:135900602
>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
>>>>>>>>>>>>>> 2600960~4096] 3.a2894ebe e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:32.510885 osd.22 [WRN]
>>>>>>>>>>>>>> slow request 32.034807 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:00.476008: osd_op(client.9266625.0:135900603
>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
>>>>>>>>>>>>>> 2617344~4096] 3.a2894ebe e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:32.510890 osd.22 [WRN]
>>>>>>>>>>>>>> slow request 32.034756 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:00.476059: osd_op(client.9266625.0:135900604
>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
>>>>>>>>>>>>>> 2637824~69632] 3.a2894ebe e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:32.723329 osd.17 [WRN] 4
>>>>>>>>>>>>>> slow requests, 4 included below; oldest blocked for >
>>>>>>>>>>>>>> 30.605794 secs 2014-09-04 11:56:32.723335 osd.17 [WRN] slow
>>>>>>>>>>>>>> request 30.605794 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:02.117495: osd_op(client.12646411.0:31118135
>>>>>>>>>>>>>> rbd_data.81aaf52eb141f2.0000000000000e4b [stat,write
>>>>>>>>>>>>>> 0~4194304] 3.716b373b e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [31,1] 2014-09-04 11:56:32.723338 osd.17 [WRN]
>>>>>>>>>>>>>> slow request 30.362042 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:02.361247: osd_op(client.12485793.0:121786489
>>>>>>>>>>>>>> rbd_data.4ae87f2eb141f2.000000000000e200 [stat,write
>>>>>>>>>>>>>> 2834432~4096] 3.6b43de72 e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [5,31] 2014-09-04 11:56:32.723341 osd.17 [WRN]
>>>>>>>>>>>>>> slow request 30.179681 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:02.543608: osd_op(client.13425079.0:7014393
>>>>>>>>>>>>>> rbd_data.54f2522eb141f2.0000000000014d01 [stat,write
>>>>>>>>>>>>>> 2502656~4096] 3.d8ceea89 e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [31,18] 2014-09-04 11:56:32.723344 osd.17 [WRN]
>>>>>>>>>>>>>> slow request 30.000568 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:02.722721: osd_op(client.9281279.0:144230325
>>>>>>>>>>>>>> rbd_data.2d99fc2eb141f2.000000000000e80a [stat,write
>>>>>>>>>>>>>> 1589248~4096] 3.16dc0e89 e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [31,18] 2014-09-04 11:56:33.511043 osd.22 [WRN]
>>>>>>>>>>>>>> 33 slow requests, 6 included below; oldest blocked for >
>>>>>>>>>>>>>> 48.746486 secs 2014-09-04 11:56:33.511049 osd.22 [WRN] slow
>>>>>>>>>>>>>> request 33.034756 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:00.476221: osd_op(client.9266625.0:135900605
>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
>>>>>>>>>>>>>> 2715648~8192] 3.a2894ebe e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:33.511053 osd.22 [WRN]
>>>>>>>>>>>>>> slow request 33.034620 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:00.476357: osd_op(client.9266625.0:135900606
>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
>>>>>>>>>>>>>> 2756608~4096] 3.a2894ebe e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:33.511056 osd.22 [WRN]
>>>>>>>>>>>>>> slow request 33.034446 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:00.476531: osd_op(client.9266625.0:135900607
>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
>>>>>>>>>>>>>> 2809856~4096] 3.a2894ebe e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:33.511060 osd.22 [WRN]
>>>>>>>>>>>>>> slow request 33.034387 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:00.476590: osd_op(client.9266625.0:135900608
>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
>>>>>>>>>>>>>> 2883584~4096] 3.a2894ebe e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:33.511063 osd.22 [WRN]
>>>>>>>>>>>>>> slow request 33.034341 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:00.476636: osd_op(client.9266625.0:135900609
>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
>>>>>>>>>>>>>> 2891776~8192] 3.a2894ebe e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:34.511206 osd.22 [WRN]
>>>>>>>>>>>>>> 38 slow requests, 6 included below; oldest blocked for >
>>>>>>>>>>>>>> 49.746664 secs 2014-09-04 11:56:34.511215 osd.22 [WRN] slow
>>>>>>>>>>>>>> request 34.034426 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:00.476729: osd_op(client.9266625.0:135900610
>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
>>>>>>>>>>>>>> 2908160~4096] 3.a2894ebe e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:34.511219 osd.22 [WRN]
>>>>>>>>>>>>>> slow request 34.034298 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:00.476857: osd_op(client.9266625.0:135900611
>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
>>>>>>>>>>>>>> 2932736~28672] 3.a2894ebe e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:34.511222 osd.22 [WRN]
>>>>>>>>>>>>>> slow request 34.034234 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:00.476921: osd_op(client.9266625.0:135900612
>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
>>>>>>>>>>>>>> 2969600~8192] 3.a2894ebe e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:34.511225 osd.22 [WRN]
>>>>>>>>>>>>>> slow request 34.034147 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:00.477008: osd_op(client.9266625.0:135900613
>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
>>>>>>>>>>>>>> 2981888~4096] 3.a2894ebe e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:34.511228 osd.22 [WRN]
>>>>>>>>>>>>>> slow request 34.034055 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:00.477100: osd_op(client.9266625.0:135900614
>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
>>>>>>>>>>>>>> 2998272~12288] 3.a2894ebe e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:35.153813 osd.37 [WRN] 1
>>>>>>>>>>>>>> slow requests, 1 included below; oldest blocked for >
>>>>>>>>>>>>>> 30.307864 secs 2014-09-04 11:56:35.153820 osd.37 [WRN] slow
>>>>>>>>>>>>>> request 30.307864 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:04.845917: osd_op(client.10785413.0:129409910
>>>>>>>>>>>>>> rbd_data.96803f2eb141f2.00000000000042b0 [stat,write
>>>>>>>>>>>>>> 1712128~4096] 3.6c5ac950 e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [22,41] 2014-09-04 11:56:35.511450 osd.22 [WRN]
>>>>>>>>>>>>>> 42 slow requests, 6 included below; oldest blocked for >
>>>>>>>>>>>>>> 50.746638 secs 2014-09-04 11:56:35.511455 osd.22 [WRN] slow
>>>>>>>>>>>>>> request 35.034219 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:00.477178: osd_op(client.9266625.0:135900615
>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
>>>>>>>>>>>>>> 3022848~8192] 3.a2894ebe e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:35.511459 osd.22 [WRN]
>>>>>>>>>>>>>> slow request 35.034143 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:00.477254: osd_op(client.9266625.0:135900616
>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
>>>>>>>>>>>>>> 3035136~12288] 3.a2894ebe e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:35.511462 osd.22 [WRN]
>>>>>>>>>>>>>> slow request 35.034082 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:00.477315: osd_op(client.9266625.0:135900617
>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
>>>>>>>>>>>>>> 3055616~4096] 3.a2894ebe e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:35.511465 osd.22 [WRN]
>>>>>>>>>>>>>> slow request 35.033965 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:00.477432: osd_op(client.9266625.0:135900618
>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.000000000000bc00 [stat,write
>>>>>>>>>>>>>> 3063808~4096] 3.a2894ebe e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:35.511469 osd.22 [WRN]
>>>>>>>>>>>>>> slow request 34.162275 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:01.349122: osd_op(client.12485646.0:110385143
>>>>>>>>>>>>>> rbd_data.4ccab12eb141f2.0000000000015b66 [stat,write
>>>>>>>>>>>>>> 909312~4096] 3.9c416abe e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [37,13] 2014-09-04 11:56:36.511603 osd.22 [WRN]
>>>>>>>>>>>>>> 34 slow requests, 6 included below; oldest blocked for >
>>>>>>>>>>>>>> 51.746797 secs 2014-09-04 11:56:36.511609 osd.22 [WRN] slow
>>>>>>>>>>>>>> request 34.401476 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:02.110080: osd_op(client.12485643.0:117955489
>>>>>>>>>>>>>> rbd_data.4aad332eb141f2.0000000000002c00 [stat,write
>>>>>>>>>>>>>> 2531328~4096] 3.b985ec92 e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [33,1] 2014-09-04 11:56:36.511612 osd.22 [WRN]
>>>>>>>>>>>>>> slow request 34.149511 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:02.362045: osd_op(client.12485793.0:121786505
>>>>>>>>>>>>>> rbd_data.4ae87f2eb141f2.0000000000018a0a [stat,write
>>>>>>>>>>>>>> 3342336~4096] 3.26907492 e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [33,1] 2014-09-04 11:56:36.511615 osd.22 [WRN]
>>>>>>>>>>>>>> slow request 34.329528 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:02.182028: osd_op(client.12646411.0:31118266
>>>>>>>>>>>>>> rbd_data.81aaf52eb141f2.0000000000000e4f [stat,write
>>>>>>>>>>>>>> 0~4194304] 3.168c978d e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [49,29] 2014-09-04 11:56:36.511618 osd.22 [WRN]
>>>>>>>>>>>>>> slow request 31.871976 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:04.639580: osd_op(client.12485793.0:121786506
>>>>>>>>>>>>>> rbd_data.4ae87f2eb141f2.0000000000018a0a [stat,write 0~4096]
>>>>>>>>>>>>>> 3.26907492 e13901) v4 currently waiting for subops from
>>>>>>>>>>>>>> [33,1] 2014-09-04 11:56:36.511621 osd.22 [WRN] slow request
>>>>>>>>>>>>>> 31.008867 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:05.502689: osd_op(client.9266625.0:135901516
>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.0000000000017436 [stat,write
>>>>>>>>>>>>>> 2682880~4096] 3.32067b97 e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [12,29] 2014-09-04 11:56:36.723882 osd.17 [WRN]
>>>>>>>>>>>>>> 5 slow requests, 1 included below; oldest blocked for >
>>>>>>>>>>>>>> 34.606349 secs 2014-09-04 11:56:36.723887 osd.17 [WRN] slow
>>>>>>>>>>>>>> request 30.603057 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:06.120787: osd_op(client.12483520.0:11473870
>>>>>>>>>>>>>> rbd_data.4b8e9b3d1b58ba.0000000000004600 [stat,write
>>>>>>>>>>>>>> 2293760~4096] 3.49398689 e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [31,18] 2014-09-04 11:56:37.511787 osd.22 [WRN]
>>>>>>>>>>>>>> 24 slow requests, 6 included below; oldest blocked for >
>>>>>>>>>>>>>> 52.746984 secs 2014-09-04 11:56:37.511793 osd.22 [WRN] slow
>>>>>>>>>>>>>> request 32.008986 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:05.502757: osd_op(client.9266625.0:135901517
>>>>>>>>>>>>>> rbd_data.42d6792eb141f2.0000000000017436 [stat,write
>>>>>>>>>>>>>> 2695168~4096] 3.32067b97 e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [12,29] 2014-09-04 11:56:37.511798 osd.22 [WRN]
>>>>>>>>>>>>>> slow request 31.749483 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:05.762260: osd_op(client.9281282.0:55728036
>>>>>>>>>>>>>> rbd_data.2d99f62eb141f2.0000000000003753 [stat,write
>>>>>>>>>>>>>> 1630208~8192] 3.6625df97 e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [12,29] 2014-09-04 11:56:37.511801 osd.22 [WRN]
>>>>>>>>>>>>>> slow request 31.745558 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:05.766185: osd_op(client.9281282.0:55728038
>>>>>>>>>>>>>> rbd_data.2d99f62eb141f2.0000000000003753 [stat,write
>>>>>>>>>>>>>> 1941504~4096] 3.6625df97 e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [12,29] 2014-09-04 11:56:37.511803 osd.22 [WRN]
>>>>>>>>>>>>>> slow request 31.390945 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:06.120798: osd_op(client.12483520.0:11473871
>>>>>>>>>>>>>> rbd_data.4b8e9b3d1b58ba.0000000000005c00 [stat,write
>>>>>>>>>>>>>> 2424832~8192] 3.47dbbb97 e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [12,29] 2014-09-04 11:56:37.511806 osd.22 [WRN]
>>>>>>>>>>>>>> slow request 31.390357 seconds old, received at 2014-09-04
>>>>>>>>>>>>>> 11:56:06.121386: osd_op(client.12483520.0:11473884
>>>>>>>>>>>>>> rbd_data.4b8e9b3d1b58ba.0000000000005c00 [stat,write
>>>>>>>>>>>>>> 2457600~16384] 3.47dbbb97 e13901) v4 currently waiting for
>>>>>>>>>>>>>> subops from [12,29]
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Kind Regards,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> David
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> ceph-users mailing list
>>>>>>>>>>>>>> ceph-users at lists.ceph.com
>>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Christian Balzer        Network/Systems Engineer
>>>>>>>>>>>>> chibi at gol.com           Global OnLine Japan/Fusion
>>>>>>>>>>>>> Communications http://www.gol.com/
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> ceph-users mailing list
>>>>>>>>>>>>> ceph-users at lists.ceph.com
>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>>>>> 
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> ceph-users mailing list
>>>>>>>>>>>>> ceph-users at lists.ceph.com
>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Christian Balzer        Network/Systems Engineer
>>>>>>>>>>> chibi at gol.com         Global OnLine Japan/Fusion Communications
>>>>>>>>>>> http://www.gol.com/
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> ceph-users mailing list
>>>>>>>>>>> ceph-users at lists.ceph.com
>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> ceph-users mailing list
>>>>>>>>>> ceph-users at lists.ceph.com
>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> -- 
>>>>>>>>>> Luis Periquito
>>>>>>>>>> Unix Engineer
>>>>>>>>>> 
>>>>>>>>>> Ocado.com
>>>>>>>>>> 
>>>>>>>>>> Head Office, Titan Court, 3 Bishop Square, Hatfield Business
>>>>>>>>>> Park, Hatfield, Herts AL10 9NE
>>>>>>>>>> 
>>>>>>>>>> Notice:  This email is confidential and may contain copyright
>>>>>>>>>> material of members of the Ocado Group. Opinions and views
>>>>>>>>>> expressed in this message may not necessarily reflect the
>>>>>>>>>> opinions and views of the members of the Ocado Group.
>>>>>>>>>> 
>>>>>>>>>> If you are not the intended recipient, please notify us
>>>>>>>>>> immediately and delete all copies of this message. Please note
>>>>>>>>>> that it is your responsibility to scan this message for viruses. 
>>>>>>>>>> 
>>>>>>>>>> References to the ?Ocado Group? are to Ocado Group plc
>>>>>>>>>> (registered in England and Wales with number 7098618) and its
>>>>>>>>>> subsidiary undertakings (as that expression is defined in the
>>>>>>>>>> Companies Act 2006) from time to time.  The registered office of
>>>>>>>>>> Ocado Group plc is Titan Court, 3 Bishops Square, Hatfield
>>>>>>>>>> Business Park, Hatfield, Herts. AL10 9NE.
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> ceph-users mailing list
>>>>>>>>>> ceph-users at lists.ceph.com
>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -- 
>>>>>>> Christian Balzer        Network/Systems Engineer                
>>>>>>> chibi at gol.com   	Global OnLine Japan/Fusion Communications
>>>>>>> http://www.gol.com/
>>>>>> 
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users at lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>> 
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users at lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users at lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>>> 
>> 
>> 
>> -- 
>> Christian Balzer        Network/Systems Engineer                
>> chibi at gol.com   	Global OnLine Japan/Fusion Communications
>> http://www.gol.com/
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140906/0cc7b23c/attachment.htm>