Yup, all running 0.93-76-gc35f422 (from gitbuilder just after Sage merged the latest straw2 fix...). I just uploaded the ceph.log to help understand the issue. Let me know if I can help further :) Thanks! Dan On Fri, Mar 13, 2015 at 9:53 PM, Samuel Just <sjust@xxxxxxxxxx> wrote: > Also, are you certain that all were running the same version? > -Sam > > > On 03/13/2015 01:42 PM, Samuel Just wrote: >> >> I've opened a bug for this (http://tracker.ceph.com/issues/11110), I bet >> it's related to the new logic for allowing recovery below min_size. Exactly >> what sha1 was running on the osds during this time period? >> -Sam >> >> On 03/13/2015 08:36 AM, Dan van der Ster wrote: >>> >>> On Fri, Mar 13, 2015 at 1:52 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx> >>> wrote: >>>> >>>> Hi Sage, >>>> >>>> Losing a message would have been plausible given the network issue we >>>> had today. >>>> >>>> I tried: >>>> >>>> # ceph osd pg-temp 75.45 6689 >>>> set 75.45 pg_temp mapping to [6689] >>>> >>>> then waited a bit. It's still incomplete -- the only difference is now >>>> I see two more past_intervals in the pg. Full query here: >>>> http://pastebin.com/TU7vVLpj >>>> >>>> I didn't have debug_osd above zero when I did that. Should I try again >>>> with debug_osd 20? >>> >>> I tried again with logging. The pg goes like this: >>> >>> incomplete -> inactive -> remapped -> remapped+peering -> remapped -> >>> inactive -> peering -> incomplete >>> >>> The killer seems to be: >>> >>> 2015-03-13 16:15:43.476925 7f3c2e055700 10 osd.6689 pg_epoch: 67050 >>> pg[75.45( v 66245'4028 (49044'1025,66245'4028] local-les=61515 n=3994 >>> ec=48759 les/c 66791/66791 67037/67050/67037) [6689,1919,2329]/[6689] >>> r=0 lpr=67050 pi=66787-67049/13 crt=66226'4026 lcod 0'0 mlcod 0'0 >>> remapped+peering] choose_acting no suitable info found (incomplete >>> backfills?), reverting to up >>> >>> Full log is here: http://pastebin.com/hZUBD9NT >>> >>> Do you have an idea what went wrong here? BTW, our firefly "prod" >>> cluster suffered from the same network problem today, but all of those >>> cluster's PGs recovered nicely. >>> Does the hammer RC have different peering logic that might apply here? >>> >>> Thanks! Dan >>> >>> >>> >>>> Thanks :) >>>> >>>> Dan >>>> >>>> On Fri, Mar 13, 2015 at 12:59 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: >>>>> >>>>> This looks a bit like a the osds may have lost a message, actually. >>>>> You can >>>>> kick an individual pg to repeer with something like >>>>> >>>>> ceph osd pg-temp 75.45 6689 >>>>> >>>>> See if that makes it go? >>>>> >>>>> sage >>>>> >>>>> >>>>> >>>>> On March 13, 2015 7:24:48 AM EDT, Dan van der Ster <dan@xxxxxxxxxxxxxx> >>>>> wrote: >>>>>> >>>>>> On Mon, Mar 9, 2015 at 4:47 PM, Gregory Farnum <greg@xxxxxxxxxxx> >>>>>> wrote: >>>>>>> >>>>>>> On Mon, Mar 9, 2015 at 8:42 AM, Dan van der Ster >>>>>>> <dan@xxxxxxxxxxxxxx> >>>>>>> wrote: >>>>>>>> >>>>>>>> Hi Sage, >>>>>>>> >>>>>>>> On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil <sage@xxxxxxxxxxxx> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> On Mon, 9 Feb 2015, David McBride wrote: >>>>>>>>>> >>>>>>>>>> On 09/02/15 15:31, Gregory Farnum wrote: >>>>>>>>>> >>>>>>>>>>> So, memory >>>>>>>>>>> usage of an OSD is usually linear in the number of PGs it >>>>>>>>>>> hosts. However, that memory can also grow based on at least one >>>>>>>>>>> other >>>>>>>>>>> thing: the number of OSD Maps required to go through peering. >>>>>>>>>>> It >>>>>>>>>>> *looks* to me like this is what you're running in to, not >>>>>>>>>>> growth on >>>>>>>>>>> the number of state machines. In particular, those >>>>>>>>>>> past_intervals >>>>>>>>>>> you >>>>>>>>>>> mentioned. ;) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hi Greg, >>>>>>>>>> >>>>>>>>>> Right, that sounds entirely plausible, and is very helpful. >>>>>>>>>> >>>>>>>>>> In practice, that means I'll need to be careful to avoid this >>>>>>>>>> situation >>>>>>>>>> occurring in production ? but given that's unlikely to occur >>>>>>>>>> except >>>>>>>>>> in the >>>>>>>>>> case of non-trivial neglect, I don't think I need be >>>>>>>>>> particularly >>>>>>>>>> concerned. >>>>>>>>>> >>>>>>>>>> (Happily, I'm in the situation that my existing cluster is >>>>>>>>>> purely for >>>>>>>>>> testing >>>>>>>>>> purposes; the data is expendable.) >>>>>>>>>> >>>>>>>>>> That said, for my own peace of mind, it would be valuable to >>>>>>>>>> have a >>>>>>>>>> procedure >>>>>>>>>> that can be used to recover from this >>>>>>>>>> state, even if it's unlikely to occur in >>>>>>>>>> practice. >>>>>>>>> >>>>>>>>> >>>>>>>>> The best luck I've had recovering from situations is something >>>>>>>>> like: >>>>>>>>> >>>>>>>>> - stop all osds >>>>>>>>> - osd set nodown >>>>>>>>> - osd set nobackfill >>>>>>>>> - osd set noup >>>>>>>>> - set map cache size smaller to reduce memory footprint. >>>>>>>>> >>>>>>>>> osd map cache size = 50 >>>>>>>>> osd map max advance = 25 >>>>>>>>> osd map share max epochs = 25 >>>>>>>>> osd pg epoch persisted max stale = 25 >>>>>>> >>>>>>> >>>>>>> It can cause extreme slowness if you get into a failure situation >>>>>>> and >>>>>>> your OSDs need to calculate past intervals across more maps than >>>>>>> will >>>>>>> fit in the cache. :( >>>>>> >>>>>> >>>>>> .. extreme slowness or is it also possible to get into a situation >>>>>> where the PGs are stuck incomplete forever? >>>>>> >>>>>> The reason I ask is because we actually had a network issue this >>>>>> morning that left OSDs flapping and a lot of osdmap epoch churn. Now >>>>>> our network has >>>>>> stabilized but 10 PGs are incomplete, even though all >>>>>> the OSDs are up. One PG looks like this, for example: >>>>>> >>>>>> pg 75.45 is stuck inactive for 87351.077529, current state incomplete, >>>>>> last acting [6689,1919,2329] >>>>>> pg 75.45 is stuck unclean for 87351.096198, current state incomplete, >>>>>> last acting [6689,1919,2329] >>>>>> pg 75.45 is incomplete, acting [6689,1919,2329] >>>>>> >>>>>> 1919 3.62000 osd.1919 up >>>>>> 1.00000 1.00000 >>>>>> 2329 3.62000 osd.2329 up >>>>>> 1.00000 1.00000 >>>>>> 6689 3.62000 osd.6689 up >>>>>> 1.00000 1.00000 >>>>>> >>>>>> The pg query output here: http://pastebin.com/WyTAU69W >>>>>> >>>>>> Is that a result of these short map caches or could it be something >>>>>> else? (we're running 0.93-76-gc35f422) >>>>>> WWGD (what would Greg do?) to activate these PGs? >>>>>> >>>>>> Thanks! Dan >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>>> in >>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html