On Mon, Mar 9, 2015 at 4:47 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote: > On Mon, Mar 9, 2015 at 8:42 AM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: >> Hi Sage, >> >> On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote: >>> On Mon, 9 Feb 2015, David McBride wrote: >>>> On 09/02/15 15:31, Gregory Farnum wrote: >>>> >>>> > So, memory usage of an OSD is usually linear in the number of PGs it >>>> > hosts. However, that memory can also grow based on at least one other >>>> > thing: the number of OSD Maps required to go through peering. It >>>> > *looks* to me like this is what you're running in to, not growth on >>>> > the number of state machines. In particular, those past_intervals you >>>> > mentioned. ;) >>>> >>>> Hi Greg, >>>> >>>> Right, that sounds entirely plausible, and is very helpful. >>>> >>>> In practice, that means I'll need to be careful to avoid this situation >>>> occurring in production ? but given that's unlikely to occur except in the >>>> case of non-trivial neglect, I don't think I need be particularly concerned. >>>> >>>> (Happily, I'm in the situation that my existing cluster is purely for testing >>>> purposes; the data is expendable.) >>>> >>>> That said, for my own peace of mind, it would be valuable to have a procedure >>>> that can be used to recover from this state, even if it's unlikely to occur in >>>> practice. >>> >>> The best luck I've had recovering from situations is something like: >>> >>> - stop all osds >>> - osd set nodown >>> - osd set nobackfill >>> - osd set noup >>> - set map cache size smaller to reduce memory footprint. >>> >>> osd map cache size = 50 >>> osd map max advance = 25 >>> osd map share max epochs = 25 >>> osd pg epoch persisted max stale = 25 > > It can cause extreme slowness if you get into a failure situation and > your OSDs need to calculate past intervals across more maps than will > fit in the cache. :( .. extreme slowness or is it also possible to get into a situation where the PGs are stuck incomplete forever? The reason I ask is because we actually had a network issue this morning that left OSDs flapping and a lot of osdmap epoch churn. Now our network has stabilized but 10 PGs are incomplete, even though all the OSDs are up. One PG looks like this, for example: pg 75.45 is stuck inactive for 87351.077529, current state incomplete, last acting [6689,1919,2329] pg 75.45 is stuck unclean for 87351.096198, current state incomplete, last acting [6689,1919,2329] pg 75.45 is incomplete, acting [6689,1919,2329] 1919 3.62000 osd.1919 up 1.00000 1.00000 2329 3.62000 osd.2329 up 1.00000 1.00000 6689 3.62000 osd.6689 up 1.00000 1.00000 The pg query output here: http://pastebin.com/WyTAU69W Is that a result of these short map caches or could it be something else? (we're running 0.93-76-gc35f422) WWGD (what would Greg do?) to activate these PGs? Thanks! Dan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html