Re: Flapping osd / continuously reported as failed

Studziński Krzysztof <krzysztof.studzinski@xxxxxxxxxxxx> · Wed, 24 Jul 2013 09:48:27 +0200



> -----Original Message-----
> From: Studziński Krzysztof
> Sent: Wednesday, July 24, 2013 1:18 AM
> To: 'Gregory Farnum'; Yehuda Sadeh
> Cc: ceph-devel@xxxxxxxxxxxxxxx; ceph-users@xxxxxxxxxxxxxx; Mostowiec
> Dominik
> Subject: RE:  Flapping osd / continuously reported as failed
> 
> > -----Original Message-----
> > From: Gregory Farnum [mailto:greg@xxxxxxxxxxx]
> > Sent: Wednesday, July 24, 2013 12:28 AM
> > To: Studziński Krzysztof; Yehuda Sadeh
> > Cc: ceph-devel@xxxxxxxxxxxxxxx; ceph-users@xxxxxxxxxxxxxx; Mostowiec
> > Dominik
> > Subject: Re:  Flapping osd / continuously reported as failed
> >
> > On Tue, Jul 23, 2013 at 3:20 PM, Studziński Krzysztof
> > <krzysztof.studzinski@xxxxxxxxxxxx> wrote:
> > >> On Tue, Jul 23, 2013 at 2:50 PM, Studziński Krzysztof
> > >> <krzysztof.studzinski@xxxxxxxxxxxx> wrote:
> > >> > Hi,
> > >> > We've got some problem with our cluster - it continuously reports
> failed
> > >> one osd and after auto-rebooting everything seems to work fine for
> some
> > >> time (few minutes). CPU util of this osd is max 8%, iostat is very low. We
> > tried
> > >> to "ceph osd out" such flapping osd, but after recovering this behavior
> > >> returned on different osd. This osd has also much more read operations
> > than
> > >> others (see file osd_reads.png linked at the bottom of the email; at
> about
> > >> 16:00 we switched off osd.57 and osd.72 started to misbehave. Osd.108
> > >> works while recovering).
> > >> >
> > >> > Extract from ceph.log:
> > >> >
> > >> > 2013-07-23 22:43:57.425839 mon.0 10.177.64.4:6789/0 24690 : [INF]
> > osd.72
> > >> 10.177.64.8:6803/22584 boot
> > >> > 2013-07-23 22:43:56.298467 osd.72 10.177.64.8:6803/22584 415 : [WRN]
> > map
> > >> e41730 wrongly marked me down
> > >> > 2013-07-23 22:50:27.572110 mon.0 10.177.64.4:6789/0 25081 : [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.9 10.177.64.4:6946/5124
> > >> > 2013-07-23 22:50:27.595044 mon.0 10.177.64.4:6789/0 25082 : [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.78 10.177.64.5:6854/5604
> > >> > 2013-07-23 22:50:27.611964 mon.0 10.177.64.4:6789/0 25083 : [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.10
> 10.177.64.4:6814/26192
> > >> > 2013-07-23 22:50:27.612009 mon.0 10.177.64.4:6789/0 25084 : [INF]
> > osd.72
> > >> 10.177.64.8:6803/22584 failed (3 reports from 3 peers after 2013-07-23
> > >> 22:50:43.611939 >= grace 20.000000)
> > >> > 2013-07-23 22:50:30.367398 7f8adb837700  0 log [WRN] : 3 slow
> requests,
> > 3
> > >> included below; oldest blocked for > 30.688891 secs
> > >> > 2013-07-23 22:50:30.367408 7f8adb837700  0 log [WRN] : slow request
> > >> 30.688891 seconds old, received at 2013-07-23 22:49:59.678453:
> > >> sd_op(client.44290048.0:125899 .dir.4168.2 [call
> rgw.bucket_prepare_op]
> > >> 3.9447554d) v4 currently no flag points reached
> > >> > 2013-07-23 22:50:30.367412 7f8adb837700  0 log [WRN] : slow request
> > >> 30.179044 seconds old, received at 2013-07-23 22:50:00.188300:
> > >> sd_op(client.44205530.0:189270 .dir.4168.2 [call rgw.bucket_list]
> > 3.9447554d)
> > >> v4 currently no flag points reached
> > >> > 2013-07-23 22:50:30.367415 7f8adb837700  0 log [WRN] : slow request
> > >> 30.171968 seconds old, received at 2013-07-23 22:50:00.195376:
> > >> sd_op(client.44203484.0:192902 .dir.4168.2 [call rgw.bucket_list]
> > 3.9447554d)
> > >> v4 currently no flag points reached
> > >> > 2013-07-23 22:51:36.082303 mon.0 10.177.64.4:6789/0 25159 : [INF]
> > osd.72
> > >> 10.177.64.8:6803/22584 boot
> > >> > 2013-07-23 22:51:35.238164 osd.72 10.177.64.8:6803/22584 420 : [WRN]
> > map
> > >> e41738 wrongly marked me down
> > >> > 2013-07-23 22:52:05.582969 mon.0 10.177.64.4:6789/0 25191 : [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.20 10.177.64.4:6913/4101
> > >> > 2013-07-23 22:52:05.587388 mon.0 10.177.64.4:6789/0 25192 : [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.9 10.177.64.4:6946/5124
> > >> > 2013-07-23 22:52:05.610925 mon.0 10.177.64.4:6789/0 25193 : [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.78 10.177.64.5:6854/5604
> > >> > 2013-07-23 22:52:05.610951 mon.0 10.177.64.4:6789/0 25194 : [INF]
> > osd.72
> > >> 10.177.64.8:6803/22584 failed (3 reports from 3 peers after 2013-07-23
> > >> 22:52:20.610895 >= grace 20.000000)
> > >> > 2013-07-23 22:52:05.630821 mon.0 10.177.64.4:6789/0 25195 : [DBG]
> > osd.72
> > >> 10.177.64.8:6803/22584 reported failed by osd.10
> 10.177.64.4:6814/26192
> > >> > 2013-07-23 22:53:47.203352 mon.0 10.177.64.4:6789/0 25300 : [INF]
> > osd.72
> > >> 10.177.64.8:6803/22584 boot
> > >> > 2013-07-23 22:53:46.417106 osd.72 10.177.64.8:6803/22584 474 : [WRN]
> > map
> > >> e41742 wrongly marked me down
> > >> >
> > >> > Could you please take a look at our config and suggest some
> > >> improvements?
> > >> > See attached "ceph pg <pg_id> query" for two groups during recovery
> > and
> > >> parts of our config file.
> > >> > Our cluster's size: 6 hosts, 26 HDD each, 156 osds, 6488 pgs, mostly in
> > one
> > >> bucket having 9M objects, 3342 GB data, 11173 GB used, 31690 GB /
> 42864
> > GB
> > >> avail.
> > >>
> > >> I'm surprised you're running into it at 9m objects but this is almost
> > >> certainly the problem. Right now the index for each RGW bucket lives
> > >> on a single OSD; you're probably having issues with whichever OSD is
> > >> receiving the bucket index reads. Is it feasible for you to shard the
> > >> contents into multiple buckets and see if things calm down?
> > >> -Greg
> > >> Software Engineer #42 @ http://inktank.com | http://ceph.com
> > >
> > > I was afraid of that answer. :-/
> > > We'll try to delete some objects for the beginning (as they are
> transformed
> > other objects and can be re-created when needed) and then try to shard it
> > as you suggest.
> > > What is your opinion about max number of objects in one bucket ?
> >
> > It depends more on how much activity you're throwing at the bucket, I
> > think. We haven't done any large-scale tests and perhaps we need to.
> > Leveldb (which we're using under the covers to store this stuff)
> > should not have any trouble with the amount of data that's in there,
> > but if you're trying to do frequent enough object lookups or puts then
> > you might just saturate the disk/node's ability to keep up. I should
> > mention that Yehuda just started a discussion on handling this in the
> > thread " rgw bucket index".
> >
> > When did you start noticing this trouble?
> 
> Most of our objects are images: originals (mostly 5-10MB) and its
> transformations (compressed, crops, filters, etc, usually <<1MB). We also
> have plenty of small text files. We mostly read them (objects are cached by
> nginx outside the cluster), but yesterday (15h ago) we had pretty large
> deployment of new service that could create plenty of transformed images
> with different sizes. Unfortunately I cannot say right now what is the
> difference in bucket size between now and yesterday, I will try to check it
> out.
> 
> From my stats I can see that in peaks we had about 1.5K PUT and 8K GET
> operations per minute. Most common use-case while PUTting an object in
> our apps is this sequence:
> - HEAD (checking if transformed object already exist),
> - GET  the original
> - PUT transformed object and its meta-data
> We also have some amount of PUTs of new objects, but I think there won't
> be many by now  (it is 1am in Poland).
> Right now there are 80 PUT and 1.5K GET operations per minute and the osd
> is still flapping.

Additional info about our index growth: the deployment I mentioned in previous mail didn't influence the speed of index grow. Currently we have 60K of new objects daily, that speed is almost constant for the last 7 days.
 --
 Best regards,
 Krzysztof Studzinski

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com