Re: Unexplainable slow request

Christian Balzer <chibi@xxxxxxx> · Tue, 20 Jan 2015 09:39:06 +0900

Hello,

Sorry for the thread necromancy, but this just happened again.
Still the exact same cluster as in the original thread (0.80.7).

Same OSD, same behavior. 
Slow requests that never returned and any new requests to that OSD also
went into that state until the OSD was restarted.
Causing of course all clients (VMs) that dealt with that OSD to hang.

Since I was nowhere near a computer at the time I asked a coworker to do
the restart (as this is now a production cluster), thus no historic ops.

Alas I _did_ install collectd and graphite, though nothing really stands
out there in the throttle sections.
The cluster wasn't particular busy at the time, definitely no CPU or
memory pressure, nothing wrong with the HW.

What however did catch my eye is the gauge-osd_numpg, after the restart of
OSD.8 it dropped from 122 to 121 and most alarmingly (to me at least) none
of the other OSDs increased their numpg count...
Where did that PG go? The cluster is healthy and shows all 1024 PGs and a
scrub later that night showed no problems either.

If anybody from the Ceph team would like access to my graphite server
while the data is not expired, give me a holler.

Given that it was the same OSD as last time and I'm not seeing anything in
the 0.80.8 changelog that looks like this bug, would a "windows solution"
maybe the way forward, as in deleting and re-creating that OSD?

Christian

On Tue, 9 Dec 2014 13:51:39 +0900 Christian Balzer wrote:

> On Mon, 8 Dec 2014 20:36:17 -0800 Gregory Farnum wrote:
> 
> > They never fixed themselves? 
> As I wrote, it took a restart of OSD 8 to resolve this on the next day.
> 
> > Did the reported times ever increase?
> Indeed, the last before the reboot was:
> ---
> 2014-12-07 13:12:42.933396 7fceac82f700  0 log [WRN] : 14 slow requests,
> 5 included below; oldest blocked for > 64336.578995 secs ---
> 
> All IOPS hitting that osd.8 (eventually the other VM did as well during a
> log write I suppose) were blocked.
> 
> > If not I think that's just a reporting bug which is fixed in an
> > unreleased branch, but I'd have to check the tracker to be sure.
> > 
> > On Mon, Dec 8, 2014 at 8:23 PM, Christian Balzer <chibi@xxxxxxx> wrote:
> > >
> > > Hello,
> > >
> > > On Mon, 8 Dec 2014 19:51:00 -0800 Gregory Farnum wrote:
> > >
> > >> On Mon, Dec 8, 2014 at 6:39 PM, Christian Balzer <chibi@xxxxxxx>
> > >> wrote:
> > >> >
> > >> > Hello,
> > >> >
> > >> > Debian Jessie cluster, thus kernel 3.16, ceph 0.80.7.
> > >> > 3 storage nodes with 8 OSDs (journals on 4 SSDs) each, 3 mons.
> > >> > 2 compute nodes, everything connected via Infiniband.
> > >> >
> > >> > This is pre-production, currently there are only 3 VMs and 2 of
> > >> > them were idle at the time. The non-idle one was having 600GB of
> > >> > maildirs copied onto it, which stresses things but not Ceph as
> > >> > those millions of small files coalesce nicely and result in
> > >> > rather few Ceph ops.
> > >> >
> > >> > A couple of hours into that copy marathon (the source FS and
> > >> > machine are slow and rsync isn't particular speedy with this kind
> > >> > of operation either) this happened:
> > >> > ---
> > >> > 2014-12-06 19:20:57.023974 osd.23 10.0.8.23:6815/3552 77 : [WRN]
> > >> > slow request 30 .673939 seconds old, received at 2014-12-06
> > >> > 19:20:26.346746: osd_op(client.33776 .0:743596
> > >> > rb.0.819b.238e1f29.00000003f52f [set-alloc-hint object_size
> > >> > 4194304 wr ite_size 4194304,write 1748992~4096] 3.efa97e35
> > >> > ack+ondisk+write e380) v4 curren tly waiting for subops from 4,8
> > >> > 2014-12-06 19:20:57.023991 osd.23 10.0.8.23:6815/3552 78 : [WRN]
> > >> > slow request 30 .673886 seconds old, received at 2014-12-06
> > >> > 19:20:26.346799: osd_op(client.33776 .0:743597
> > >> > rb.0.819b.238e1f29.00000003f52f [set-alloc-hint object_size
> > >> > 4194304 wr ite_size 4194304,write 1945600~4096] 3.efa97e35
> > >> > ack+ondisk+write e380) v4 curren tly waiting for subops from 4,8
> > >> > 2014-12-06 19:20:57.323976 osd.1 10.0.8.21:6815/4868 123 : [WRN]
> > >> > slow request 30 .910821 seconds old, received at 2014-12-06
> > >> > 19:20:26.413051: osd_op(client.33776 .0:743604
> > >> > rb.0.819b.238e1f29.00000003e628 [set-alloc-hint object_size
> > >> > 4194304 wr ite_size 4194304,write 1794048~1835008] 3.5e76b8ba
> > >> > ack+ondisk+write e380) v4 cur rently waiting for subops from 8,17
> > >> > ---
> > >> >
> > >> > There were a few more later, but they all involved OSD 8 as common
> > >> > factor.
> > >> >
> > >> > Alas there's nothing in the osd-8.log indicating why:
> > >> > ---
> > >> > 2014-12-06 19:13:13.933636 7fce85552700  0 -- 10.0.8.22:6835/5389
> > >> > >> 10.0.8.6:0/ 716350435 pipe(0x7fcec3c25900 sd=23 :6835 s=0
> > >> > >> pgs=0 cs=0
> > >> > l=0 c=0x7fcebfad03c0).a ccept peer addr is really
> > >> > 10.0.8.6:0/716350435 (socket is 10.0.8.6:50592/0) 2014-12-06
> > >> > 19:20:56.595773 7fceac82f700 0 log [WRN] : 3 slow requests, 3
> > >> > included below; oldest blocked for > 30.241397 secs 2014-12-06
> > >> > 19:20:56.595796 7fceac82f700  0 log [WRN] : slow request 30.241397
> > >> > seconds old, received at 2014-12-06 19:20:26.354247:
> > >> > osd_sub_op(client.33776.0:743596 3.235
> > >> > efa97e35/rb.0.819b.238e1f29.00000003f52f/head//3 [] v 380'3783
> > >> > snapset=0=[]:[] snapc=0=[]) v11 currently started 2014-12-06
> > >> > 19:20:56.595825 7fceac82f700  0 log [WRN] : slow request 30.240286
> > >> > seconds old, received at 2014-12-06 19:20:26.355358:
> > >> > osd_sub_op(client.33776.0:743597 3.235
> > >> > efa97e35/rb.0.819b.238e1f29.00000003f52f/head//3 [] v 380'3784
> > >> > snapset=0=[]:[] snapc=0=[]) v11 currently started 2014-12-06
> > >> > 19:20:56.595837 7fceac82f700  0 log [WRN] : slow request 30.177186
> > >> > seconds old, received at 2014-12-06 19:20:26.418458:
> > >> > osd_sub_op(client.33776.0:743604 3.ba
> > >> > 5e76b8ba/rb.0.819b.238e1f29.00000003e628/head//3 [] v 380'6439
> > >> > snapset=0=[]:[] snapc=0=[]) v11 currently started ----
> > >>
> > >> That these are started and nothing else suggests that they're
> > >> probably waiting for one of the throttles to let them in, rather
> > >> than themselves being particularly slow.
> > >>
> > >
> > > If this was indeed caused by one of the (rather numerous) throttles,
> > > wouldn't it be a good idea to log that fact?
> > > A slow disk is one thing, Ceph permanently seizing up because
> > > something exceeded a threshold sounds noteworthy to me.
> > 
> > If it permanently seized up then this is not what happened; 
> 
> So am I looking at an unknown bug then?
> 
> > if the
> > reporting just didn't go away then I'm not sure it's appropriate to
> > log every time a throttle gets hit (some of them are supposed to or at
> > least expected to be, since they keep the journal from running away
> > from the backing store).
> > 
> Fair enough, I'd suppose that anything causing a WRN log entry which in
> turn is caused by a throttle triggering should have that root cause
> included though.
> 
> Christian
> 
> > >
> > >> >
> > >> > The HDDs and SSDs are new, there's nothing in the pertinent logs
> > >> > or smart that indicates any problem with that HDD or its journal
> > >> > SSD, nor the system in general.
> > >> > This problem persisted (and the VM remained stuck) until OSD 8 was
> > >> > restarted the next day when I discovered this.
> > >> >
> > >> > I suppose this is another "this can't/shouldn't happen" case, but
> > >> > I'd be delighted about any suggestions as to what happened here,
> > >> > potential prevention measures and any insights on how to maybe
> > >> > coax more information out of Ceph if this happens again.
> > >>
> > >> Nah, there are a million reasons stuff can be slow.
> > > This wasn't just slow. Those requests never completed even after
> > > half a day had passed with the system and disks being basically idle.
> > >
> > >> It might just be a
> > >> transient overload of the disk compared to the others.
> > >
> > > Transient would be fine (though highly unlikely in this scenario),
> > > however it never recovered, see above.
> > >
> > >
> > >> If you see this
> > >> again while it's happening I'd check the perfcounters; if you're
> > >> keeping historical checks of them go look at the blocked-up times
> > >> and see if any of them are at or near their maximum values.
> > >> -Gre
> > >>
> > > Ah, I should have thought of that before re-starting that OSD.
> > > dump_historic_ops is at the default values, so that information is
> > > long gone.
> > >
> > > Regards,
> > >
> > > Christian
> > > --
> > > Christian Balzer        Network/Systems Engineer
> > > chibi@xxxxxxx           Global OnLine Japan/Fusion Communications
> > > http://www.gol.com/
> > 
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com