Also, are you certain that all were running the same version?
-Sam
On 03/13/2015 01:42 PM, Samuel Just wrote:
I've opened a bug for this (http://tracker.ceph.com/issues/11110), I
bet it's related to the new logic for allowing recovery below
min_size. Exactly what sha1 was running on the osds during this time
period?
-Sam
On 03/13/2015 08:36 AM, Dan van der Ster wrote:
On Fri, Mar 13, 2015 at 1:52 PM, Dan van der Ster
<dan@xxxxxxxxxxxxxx> wrote:
Hi Sage,
Losing a message would have been plausible given the network issue
we had today.
I tried:
# ceph osd pg-temp 75.45 6689
set 75.45 pg_temp mapping to [6689]
then waited a bit. It's still incomplete -- the only difference is now
I see two more past_intervals in the pg. Full query here:
http://pastebin.com/TU7vVLpj
I didn't have debug_osd above zero when I did that. Should I try again
with debug_osd 20?
I tried again with logging. The pg goes like this:
incomplete -> inactive -> remapped -> remapped+peering -> remapped ->
inactive -> peering -> incomplete
The killer seems to be:
2015-03-13 16:15:43.476925 7f3c2e055700 10 osd.6689 pg_epoch: 67050
pg[75.45( v 66245'4028 (49044'1025,66245'4028] local-les=61515 n=3994
ec=48759 les/c 66791/66791 67037/67050/67037) [6689,1919,2329]/[6689]
r=0 lpr=67050 pi=66787-67049/13 crt=66226'4026 lcod 0'0 mlcod 0'0
remapped+peering] choose_acting no suitable info found (incomplete
backfills?), reverting to up
Full log is here: http://pastebin.com/hZUBD9NT
Do you have an idea what went wrong here? BTW, our firefly "prod"
cluster suffered from the same network problem today, but all of those
cluster's PGs recovered nicely.
Does the hammer RC have different peering logic that might apply here?
Thanks! Dan
Thanks :)
Dan
On Fri, Mar 13, 2015 at 12:59 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
This looks a bit like a the osds may have lost a message,
actually. You can
kick an individual pg to repeer with something like
ceph osd pg-temp 75.45 6689
See if that makes it go?
sage
On March 13, 2015 7:24:48 AM EDT, Dan van der Ster
<dan@xxxxxxxxxxxxxx>
wrote:
On Mon, Mar 9, 2015 at 4:47 PM, Gregory Farnum <greg@xxxxxxxxxxx>
wrote:
On Mon, Mar 9, 2015 at 8:42 AM, Dan van der Ster
<dan@xxxxxxxxxxxxxx>
wrote:
Hi Sage,
On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil <sage@xxxxxxxxxxxx>
wrote:
On Mon, 9 Feb 2015, David McBride wrote:
On 09/02/15 15:31, Gregory Farnum wrote:
So, memory
usage of an OSD is usually linear in the number of PGs it
hosts. However, that memory can also grow based on at least
one
other
thing: the number of OSD Maps required to go through
peering. It
*looks* to me like this is what you're running in to, not
growth on
the number of state machines. In particular, those
past_intervals
you
mentioned. ;)
Hi Greg,
Right, that sounds entirely plausible, and is very helpful.
In practice, that means I'll need to be careful to avoid this
situation
occurring in production ? but given that's unlikely to occur
except
in the
case of non-trivial neglect, I don't think I need be
particularly
concerned.
(Happily, I'm in the situation that my existing cluster is
purely for
testing
purposes; the data is expendable.)
That said, for my own peace of mind, it would be valuable to
have a
procedure
that can be used to recover from this
state, even if it's unlikely to occur in
practice.
The best luck I've had recovering from situations is
something like:
- stop all osds
- osd set nodown
- osd set nobackfill
- osd set noup
- set map cache size smaller to reduce memory footprint.
osd map cache size = 50
osd map max advance = 25
osd map share max epochs = 25
osd pg epoch persisted max stale = 25
It can cause extreme slowness if you get into a failure
situation and
your OSDs need to calculate past intervals across more maps
than will
fit in the cache. :(
.. extreme slowness or is it also possible to get into a situation
where the PGs are stuck incomplete forever?
The reason I ask is because we actually had a network issue this
morning that left OSDs flapping and a lot of osdmap epoch churn. Now
our network has
stabilized but 10 PGs are incomplete, even though all
the OSDs are up. One PG looks like this, for example:
pg 75.45 is stuck inactive for 87351.077529, current state
incomplete,
last acting [6689,1919,2329]
pg 75.45 is stuck unclean for 87351.096198, current state incomplete,
last acting [6689,1919,2329]
pg 75.45 is incomplete, acting [6689,1919,2329]
1919 3.62000 osd.1919 up
1.00000 1.00000
2329 3.62000 osd.2329 up
1.00000 1.00000
6689 3.62000 osd.6689 up
1.00000 1.00000
The pg query output here: http://pastebin.com/WyTAU69W
Is that a result of these short map caches or could it be something
else? (we're running 0.93-76-gc35f422)
WWGD (what would Greg do?) to activate these PGs?
Thanks! Dan
--
To unsubscribe from this list: send the line "unsubscribe
ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html