Re: OSDs failing on upgrade from Giant to Hammer

Samuel Just <sjust@xxxxxxxxxx> · Tue, 21 Apr 2015 04:04:13 -0400 (EDT)

Yep, you have hit bug 11429.  At some point, you removed a pool and then restarted these osds.  Due to the original bug, 10617, those osds never actually removed the pgs in that pool.  I'm working on a fix, or you can manually remove pgs corresponding to pools which no longer exist from the crashing osds using the ceph-objectstore-tool.
-Sam

----- Original Message -----
From: "Scott Laird" <scott@xxxxxxxxxxx>
To: "Samuel Just" <sjust@xxxxxxxxxx>
Cc: "Robert LeBlanc" <robert@xxxxxxxxxxxxx>, "'ceph-users@xxxxxxxxxxxxxx' (ceph-users@xxxxxxxxxxxxxx)" <ceph-users@xxxxxxxxxxxxxx>
Sent: Monday, April 20, 2015 6:13:06 AM
Subject: Re:  OSDs failing on upgrade from Giant to Hammer

They're kind of big; here are links:

https://dl.dropboxusercontent.com/u/104949139/osdmap
https://dl.dropboxusercontent.com/u/104949139/ceph-osd.36.log

On Sun, Apr 19, 2015 at 8:42 PM Samuel Just <sjust@xxxxxxxxxx> wrote:

> I have a suspicion about what caused this.  Can you restart one of the
> problem osds with
>
> debug osd = 20
> debug filestore = 20
> debug ms = 1
>
> and attach the resulting log from startup to crash along with the osdmap
> binary (ceph osd getmap -o <mapfile>).
> -Sam
>
> ----- Original Message -----
> From: "Scott Laird" <scott@xxxxxxxxxxx>
> To: "Robert LeBlanc" <robert@xxxxxxxxxxxxx>
> Cc: "'ceph-users@xxxxxxxxxxxxxx' (ceph-users@xxxxxxxxxxxxxx)" <
> ceph-users@xxxxxxxxxxxxxx>
> Sent: Sunday, April 19, 2015 6:13:55 PM
> Subject: Re:  OSDs failing on upgrade from Giant to Hammer
>
> Nope. Straight from 0.87 to 0.94.1. FWIW, at someone's suggestion, I just
> upgraded the kernel on one of the boxes from 3.14 to 3.18; no improvement.
> Rebooting didn't help, either. Still failing with the same error in the
> logs.
>
> On Sun, Apr 19, 2015 at 2:06 PM Robert LeBlanc < robert@xxxxxxxxxxxxx >
> wrote:
>
>
>
> Did you upgrade from 0.92? If you did, did you flush the logs before
> upgrading?
>
> On Sun, Apr 19, 2015 at 1:02 PM, Scott Laird < scott@xxxxxxxxxxx > wrote:
>
>
>
> I'm upgrading from Giant to Hammer (0.94.1), and I'm seeing a ton of OSDs
> die (and stay dead) with this error in the logs:
>
> 2015-04-19 11:53:36.796847 7f61fa900900 -1 osd/OSD.h: In function
> 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f61fa900900 time
> 2015-04-19 11:53:36.794951
> osd/OSD.h: 716: FAILED assert(ret)
>
> ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x8b) [0xbc271b]
> 2: (OSDService::get_map(unsigned int)+0x3f) [0x70923f]
> 3: (OSD::load_pgs()+0x1769) [0x6c35d9]
> 4: (OSD::init()+0x71f) [0x6c4c7f]
> 5: (main()+0x2860) [0x651fc0]
> 6: (__libc_start_main()+0xf5) [0x7f61f7a3fec5]
> 7: /usr/bin/ceph-osd() [0x66aff7]
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
> This is on a small cluster, with ~40 OSDs on 5 servers running Ubuntu
> 14.04. So far, every single server that I've upgraded has had at least one
> disk that has failed to restart with this error, and one has had several
> disks in this state.
>
> Restarting the OSD after it dies with this doesn't help.
>
> I haven't lost any data through this due to my slow rollout, but it's
> really annoying.
>
> Here are two full logs from OSDs on two different machines:
>
> https://dl.dropboxusercontent.com/u/104949139/ceph-osd.25.log
> https://dl.dropboxusercontent.com/u/104949139/ceph-osd.34.log
>
> Any suggestions?
>
>
> Scott
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com