Yep, you have hit bug 11429. At some point, you removed a pool and then restarted these osds. Due to the original bug, 10617, those osds never actually removed the pgs in that pool. I'm working on a fix, or you can manually remove pgs corresponding to pools which no longer exist from the crashing osds using the ceph-objectstore-tool. -Sam ----- Original Message ----- From: "Scott Laird" <scott@xxxxxxxxxxx> To: "Samuel Just" <sjust@xxxxxxxxxx> Cc: "Robert LeBlanc" <robert@xxxxxxxxxxxxx>, "'ceph-users@xxxxxxxxxxxxxx' (ceph-users@xxxxxxxxxxxxxx)" <ceph-users@xxxxxxxxxxxxxx> Sent: Monday, April 20, 2015 6:13:06 AM Subject: Re: OSDs failing on upgrade from Giant to Hammer They're kind of big; here are links: https://dl.dropboxusercontent.com/u/104949139/osdmap https://dl.dropboxusercontent.com/u/104949139/ceph-osd.36.log On Sun, Apr 19, 2015 at 8:42 PM Samuel Just <sjust@xxxxxxxxxx> wrote: > I have a suspicion about what caused this. Can you restart one of the > problem osds with > > debug osd = 20 > debug filestore = 20 > debug ms = 1 > > and attach the resulting log from startup to crash along with the osdmap > binary (ceph osd getmap -o <mapfile>). > -Sam > > ----- Original Message ----- > From: "Scott Laird" <scott@xxxxxxxxxxx> > To: "Robert LeBlanc" <robert@xxxxxxxxxxxxx> > Cc: "'ceph-users@xxxxxxxxxxxxxx' (ceph-users@xxxxxxxxxxxxxx)" < > ceph-users@xxxxxxxxxxxxxx> > Sent: Sunday, April 19, 2015 6:13:55 PM > Subject: Re: OSDs failing on upgrade from Giant to Hammer > > Nope. Straight from 0.87 to 0.94.1. FWIW, at someone's suggestion, I just > upgraded the kernel on one of the boxes from 3.14 to 3.18; no improvement. > Rebooting didn't help, either. Still failing with the same error in the > logs. > > On Sun, Apr 19, 2015 at 2:06 PM Robert LeBlanc < robert@xxxxxxxxxxxxx > > wrote: > > > > Did you upgrade from 0.92? If you did, did you flush the logs before > upgrading? > > On Sun, Apr 19, 2015 at 1:02 PM, Scott Laird < scott@xxxxxxxxxxx > wrote: > > > > I'm upgrading from Giant to Hammer (0.94.1), and I'm seeing a ton of OSDs > die (and stay dead) with this error in the logs: > > 2015-04-19 11:53:36.796847 7f61fa900900 -1 osd/OSD.h: In function > 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f61fa900900 time > 2015-04-19 11:53:36.794951 > osd/OSD.h: 716: FAILED assert(ret) > > ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x8b) [0xbc271b] > 2: (OSDService::get_map(unsigned int)+0x3f) [0x70923f] > 3: (OSD::load_pgs()+0x1769) [0x6c35d9] > 4: (OSD::init()+0x71f) [0x6c4c7f] > 5: (main()+0x2860) [0x651fc0] > 6: (__libc_start_main()+0xf5) [0x7f61f7a3fec5] > 7: /usr/bin/ceph-osd() [0x66aff7] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > This is on a small cluster, with ~40 OSDs on 5 servers running Ubuntu > 14.04. So far, every single server that I've upgraded has had at least one > disk that has failed to restart with this error, and one has had several > disks in this state. > > Restarting the OSD after it dies with this doesn't help. > > I haven't lost any data through this due to my slow rollout, but it's > really annoying. > > Here are two full logs from OSDs on two different machines: > > https://dl.dropboxusercontent.com/u/104949139/ceph-osd.25.log > https://dl.dropboxusercontent.com/u/104949139/ceph-osd.34.log > > Any suggestions? > > > Scott > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com