Re: Flapping osd / continuously reported as failed

Mostowiec Dominik <Dominik.Mostowiec@xxxxxxxxxxxx> · Thu, 25 Jul 2013 09:47:36 +0200

Hi
We found something else.
After osd.72 flapp, one PG '3.54d' was recovering long time.

--
ceph health details
HEALTH_WARN 1 pgs recovering; recovery 1/39821745 degraded (0.000%)
pg 3.54d is active+recovering, acting [72,108,23]
recovery 1/39821745 degraded (0.000%)
--

Last flap down/up osd.72 was 00:45.
In logs we found:
2013-07-24 00:45:02.736740 7f8ac1e04700  0 log [INF] : 3.54d deep-scrub ok
After this time is ok.

It is possible that reason of flapping this osd was scrubbing?

We have default scrubbing settings (ceph version 0.56.6).
If scrubbig is the trouble-maker, can we make it a little more light by changing config?

--
Regards
Dominik

-----Original Message-----
From: Studziński Krzysztof 
Sent: Wednesday, July 24, 2013 9:48 AM
To: Gregory Farnum; Yehuda Sadeh
Cc: ceph-devel@xxxxxxxxxxxxxxx; ceph-users@xxxxxxxxxxxxxx; Mostowiec Dominik
Subject: RE:  Flapping osd / continuously reported as failed

> -----Original Message-----
> From: Studziński Krzysztof
> Sent: Wednesday, July 24, 2013 1:18 AM
> To: 'Gregory Farnum'; Yehuda Sadeh
> Cc: ceph-devel@xxxxxxxxxxxxxxx; ceph-users@xxxxxxxxxxxxxx; Mostowiec 
> Dominik
> Subject: RE:  Flapping osd / continuously reported as 
> failed
> 
> > -----Original Message-----
> > From: Gregory Farnum [mailto:greg@xxxxxxxxxxx]
> > Sent: Wednesday, July 24, 2013 12:28 AM
> > To: Studziński Krzysztof; Yehuda Sadeh
> > Cc: ceph-devel@xxxxxxxxxxxxxxx; ceph-users@xxxxxxxxxxxxxx; Mostowiec 
> > Dominik
> > Subject: Re:  Flapping osd / continuously reported as 
> > failed
> >
> > On Tue, Jul 23, 2013 at 3:20 PM, Studziński Krzysztof 
> > <krzysztof.studzinski@xxxxxxxxxxxx> wrote:
> > >> On Tue, Jul 23, 2013 at 2:50 PM, Studziński Krzysztof 
> > >> <krzysztof.studzinski@xxxxxxxxxxxx> wrote:
> > >> > Hi,
> > >> > We've got some problem with our cluster - it continuously 
> > >> > reports
> failed
> > >> one osd and after auto-rebooting everything seems to work fine 
> > >> for
> some
> > >> time (few minutes). CPU util of this osd is max 8%, iostat is 
> > >> very low. We
> > tried
> > >> to "ceph osd out" such flapping osd, but after recovering this 
> > >> behavior returned on different osd. This osd has also much more 
> > >> read operations
> > than
> > >> others (see file osd_reads.png linked at the bottom of the email; 
> > >> at
> about
> > >> 16:00 we switched off osd.57 and osd.72 started to misbehave. 
> > >> Osd.108 works while recovering).
> > >> >
> > >> > Extract from ceph.log:
> > >> >
> > >> > 2013-07-23 22:43:57.425839 mon.0 10.177.64.4:6789/0 24690 : 
> > >> > [INF]
> > osd.72
> > >> 10.177.64.8:6803/22584 boot
> > >> > 2013-07-23 22:43:56.298467 osd.72 10.177.64.8:6803/22584 415 : 
> > >> > [WRN]
> > map
> > >> e41730 wrongly marked me down
> > >> > 2013-07-23 22:50:27.572110 mon.0 10.177.64.4:6789/0 25081 : 
> > >> > [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.9 
> > >> 10.177.64.4:6946/5124
> > >> > 2013-07-23 22:50:27.595044 mon.0 10.177.64.4:6789/0 25082 : 
> > >> > [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.78 
> > >> 10.177.64.5:6854/5604
> > >> > 2013-07-23 22:50:27.611964 mon.0 10.177.64.4:6789/0 25083 : 
> > >> > [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.10
> 10.177.64.4:6814/26192
> > >> > 2013-07-23 22:50:27.612009 mon.0 10.177.64.4:6789/0 25084 : 
> > >> > [INF]
> > osd.72
> > >> 10.177.64.8:6803/22584 failed (3 reports from 3 peers after 
> > >> 2013-07-23
> > >> 22:50:43.611939 >= grace 20.000000)
> > >> > 2013-07-23 22:50:30.367398 7f8adb837700  0 log [WRN] : 3 slow
> requests,
> > 3
> > >> included below; oldest blocked for > 30.688891 secs
> > >> > 2013-07-23 22:50:30.367408 7f8adb837700  0 log [WRN] : slow 
> > >> > request
> > >> 30.688891 seconds old, received at 2013-07-23 22:49:59.678453:
> > >> sd_op(client.44290048.0:125899 .dir.4168.2 [call
> rgw.bucket_prepare_op]
> > >> 3.9447554d) v4 currently no flag points reached
> > >> > 2013-07-23 22:50:30.367412 7f8adb837700  0 log [WRN] : slow 
> > >> > request
> > >> 30.179044 seconds old, received at 2013-07-23 22:50:00.188300:
> > >> sd_op(client.44205530.0:189270 .dir.4168.2 [call rgw.bucket_list]
> > 3.9447554d)
> > >> v4 currently no flag points reached
> > >> > 2013-07-23 22:50:30.367415 7f8adb837700  0 log [WRN] : slow 
> > >> > request
> > >> 30.171968 seconds old, received at 2013-07-23 22:50:00.195376:
> > >> sd_op(client.44203484.0:192902 .dir.4168.2 [call rgw.bucket_list]
> > 3.9447554d)
> > >> v4 currently no flag points reached
> > >> > 2013-07-23 22:51:36.082303 mon.0 10.177.64.4:6789/0 25159 : 
> > >> > [INF]
> > osd.72
> > >> 10.177.64.8:6803/22584 boot
> > >> > 2013-07-23 22:51:35.238164 osd.72 10.177.64.8:6803/22584 420 : 
> > >> > [WRN]
> > map
> > >> e41738 wrongly marked me down
> > >> > 2013-07-23 22:52:05.582969 mon.0 10.177.64.4:6789/0 25191 : 
> > >> > [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.20 
> > >> 10.177.64.4:6913/4101
> > >> > 2013-07-23 22:52:05.587388 mon.0 10.177.64.4:6789/0 25192 : 
> > >> > [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.9 
> > >> 10.177.64.4:6946/5124
> > >> > 2013-07-23 22:52:05.610925 mon.0 10.177.64.4:6789/0 25193 : 
> > >> > [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.78 
> > >> 10.177.64.5:6854/5604
> > >> > 2013-07-23 22:52:05.610951 mon.0 10.177.64.4:6789/0 25194 : 
> > >> > [INF]
> > osd.72
> > >> 10.177.64.8:6803/22584 failed (3 reports from 3 peers after 
> > >> 2013-07-23
> > >> 22:52:20.610895 >= grace 20.000000)
> > >> > 2013-07-23 22:52:05.630821 mon.0 10.177.64.4:6789/0 25195 : 
> > >> > [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.10
> 10.177.64.4:6814/26192
> > >> > 2013-07-23 22:53:47.203352 mon.0 10.177.64.4:6789/0 25300 : 
> > >> > [INF]
> > osd.72
> > >> 10.177.64.8:6803/22584 boot
> > >> > 2013-07-23 22:53:46.417106 osd.72 10.177.64.8:6803/22584 474 : 
> > >> > [WRN]
> > map
> > >> e41742 wrongly marked me down
> > >> >
> > >> > Could you please take a look at our config and suggest some
> > >> improvements?
> > >> > See attached "ceph pg <pg_id> query" for two groups during 
> > >> > recovery
> > and
> > >> parts of our config file.
> > >> > Our cluster's size: 6 hosts, 26 HDD each, 156 osds, 6488 pgs, 
> > >> > mostly in
> > one
> > >> bucket having 9M objects, 3342 GB data, 11173 GB used, 31690 GB /
> 42864
> > GB
> > >> avail.
> > >>
> > >> I'm surprised you're running into it at 9m objects but this is 
> > >> almost certainly the problem. Right now the index for each RGW 
> > >> bucket lives on a single OSD; you're probably having issues with 
> > >> whichever OSD is receiving the bucket index reads. Is it feasible 
> > >> for you to shard the contents into multiple buckets and see if things calm down?
> > >> -Greg
> > >> Software Engineer #42 @ http://inktank.com | http://ceph.com
> > >
> > > I was afraid of that answer. :-/
> > > We'll try to delete some objects for the beginning (as they are
> transformed
> > other objects and can be re-created when needed) and then try to 
> > shard it as you suggest.
> > > What is your opinion about max number of objects in one bucket ?
> >
> > It depends more on how much activity you're throwing at the bucket, 
> > I think. We haven't done any large-scale tests and perhaps we need to.
> > Leveldb (which we're using under the covers to store this stuff) 
> > should not have any trouble with the amount of data that's in there, 
> > but if you're trying to do frequent enough object lookups or puts 
> > then you might just saturate the disk/node's ability to keep up. I 
> > should mention that Yehuda just started a discussion on handling 
> > this in the thread " rgw bucket index".
> >
> > When did you start noticing this trouble?
> 
> Most of our objects are images: originals (mostly 5-10MB) and its 
> transformations (compressed, crops, filters, etc, usually <<1MB). We 
> also have plenty of small text files. We mostly read them (objects are 
> cached by nginx outside the cluster), but yesterday (15h ago) we had 
> pretty large deployment of new service that could create plenty of 
> transformed images with different sizes. Unfortunately I cannot say 
> right now what is the difference in bucket size between now and 
> yesterday, I will try to check it out.
> 
> From my stats I can see that in peaks we had about 1.5K PUT and 8K GET 
> operations per minute. Most common use-case while PUTting an object in 
> our apps is this sequence:
> - HEAD (checking if transformed object already exist),
> - GET  the original
> - PUT transformed object and its meta-data We also have some amount of 
> PUTs of new objects, but I think there won't be many by now  (it is 
> 1am in Poland).
> Right now there are 80 PUT and 1.5K GET operations per minute and the 
> osd is still flapping.

Additional info about our index growth: the deployment I mentioned in previous mail didn't influence the speed of index grow. Currently we have 60K of new objects daily, that speed is almost constant for the last 7 days.
 --
 Best regards,
 Krzysztof Studzinski

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com