Re: PG stuck peering after host reboot

Wido den Hollander <wido@xxxxxxxx> · Fri, 24 Feb 2017 09:05:49 +0100 (CET)

> Op 23 februari 2017 om 19:09 schreef george.vasilakakos@xxxxxxxxxx:
> 
> 
> Since we need this pool to work again, we decided to take the data loss and try to move on.
> 
> So far, no luck. We tried a force create but, as expected, with a PG that is not peering this did absolutely nothing.

True, only works for a stale PG.

> We also tried rm-past-intervals and remove from ceph-objectstore-tool and manually deleting the data directories in the disks. The PG remains down+remapped with two OSDs failing to join the acting set. These have been restarted multiple times to no avail.

So you removed the PG from all the OSDs? 595,1391,240,127,937,362,267,320,986,634,716?

> 
> # ceph pg map 1.323
> osdmap e23122 pg 1.323 (1.323) -> up [595,1391,240,127,937,362,267,320,986,634,716] acting [595,1391,240,127,937,362,267,320,986,2147483647,2147483647]
> 
> We have also seen some very odd behaviour. 
> # ceph pg map 1.323
> osdmap e22909 pg 1.323 (1.323) -> up [595,1391,240,127,937,362,267,320,986,634,716] acting [595,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647]
> 
> Straight after a restart of all OSDs in the PG and after everything else has settled down. From that state restarting 595 results in:
> 
> # ceph pg map 1.323
> osdmap e22921 pg 1.323 (1.323) -> up [595,1391,240,127,937,362,267,320,986,634,716] acting [2147483647,1391,240,127,937,362,267,320,986,634,716]
> 
> Restarting 595 doesn't change this. Another restart of all OSDs in the PG results in the state seen above with the last two replaced by ITEM_NONE.
> 
> Another strange thing is that on osd.7 (the one originally at rank 8 that was restarted and caused this problem) the objectstore tool fails to remove the PG and crashes out:
> 
> # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-7 --op remove --pgid 1.323s8
>  marking collection for removal
> setting '_remove' omap key
> finish_remove_pgs 1.323s8_head removing 1.323s8
>  *** Caught signal (Aborted) **
>  in thread 7fa713782700 thread_name:tp_fstore_op
>  ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)
>  1: (()+0x97463a) [0x7fa71c47563a]
>  2: (()+0xf370) [0x7fa71935a370]
>  3: (snappy::RawUncompress(snappy::Source*, char*)+0x374) [0x7fa71abd0cd4]
>  4: (snappy::RawUncompress(char const*, unsigned long, char*)+0x3d) [0x7fa71abd0e2d]
>  5: (leveldb::ReadBlock(leveldb::RandomAccessFile*, leveldb::ReadOptions const&, leveldb::BlockHandle const&, leveldb::BlockContents*)+0x35e) [0x7fa71b08007e]
>  6: (leveldb::Table::BlockReader(void*, leveldb::ReadOptions const&, leveldb::Slice const&)+0x276) [0x7fa71b081196]
>  7: (()+0x3c820) [0x7fa71b083820]
>  8: (()+0x3c9cd) [0x7fa71b0839cd]
>  9: (()+0x3ca3e) [0x7fa71b083a3e]
>  10: (()+0x39c75) [0x7fa71b080c75]
>  11: (()+0x21e20) [0x7fa71b068e20]
>  12: (()+0x223c5) [0x7fa71b0693c5]
>  13: (LevelDBStore::LevelDBWholeSpaceIteratorImpl::seek_to_first(std::string const&)+0x3d) [0x7fa71c3ecb1d]
>  14: (LevelDBStore::LevelDBTransactionImpl::rmkeys_by_prefix(std::string const&)+0x138) [0x7fa71c3ec028]
>  15: (DBObjectMap::clear_header(std::shared_ptr<DBObjectMap::_Header>, std::shared_ptr<KeyValueDB::TransactionImpl>)+0x1d0) [0x7fa71c400a40]
>  16: (DBObjectMap::_clear(std::shared_ptr<DBObjectMap::_Header>, std::shared_ptr<KeyValueDB::TransactionImpl>)+0xa1) [0x7fa71c401171]
>  17: (DBObjectMap::clear(ghobject_t const&, SequencerPosition const*)+0x1ff) [0x7fa71c4075bf]
>  18: (FileStore::lfn_unlink(coll_t const&, ghobject_t const&, SequencerPosition const&, bool)+0x241) [0x7fa71c2c0d41]
>  19: (FileStore::_remove(coll_t const&, ghobject_t const&, SequencerPosition const&)+0x8e) [0x7fa71c2c171e]
>  20: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0x433e) [0x7fa71c2d8c6e]
>  21: (FileStore::_do_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, unsigned long, ThreadPool::TPHandle*)+0x3b) [0x7fa71c2db75b]
>  22: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x2cd) [0x7fa71c2dba5d]
>  23: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb59) [0x7fa71c63e189]
>  24: (ThreadPool::WorkThread::entry()+0x10) [0x7fa71c63f160]
>  25: (()+0x7dc5) [0x7fa719352dc5]
>  26: (clone()+0x6d) [0x7fa71843e73d]
> Aborted
> 
> At this point all we want to achieve is for the PG to peer again (and soon) without us having to delete the pool.
> 
> Any help would be appreciated...

At first, my EC experience here is too small to exactly tell you what is happening.

What you could do:

- Remove these OSDs from CRUSH
- Wait for recovery to complete
- Stop the OSDS
- Remove their cephx key
- Mark them as lost

At this point PG 1.323 should go into a incomplete or stale state. You should then be able to force re-create it.

This worked for me with a replicated pool, never tried this with EC.

Afterwards you can re-create these OSDs again.

Wido

> ________________________________________
> From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of george.vasilakakos@xxxxxxxxxx [george.vasilakakos@xxxxxxxxxx]
> Sent: 22 February 2017 14:35
> To: wido@xxxxxxxx; ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  PG stuck peering after host reboot
> 
> So what I see there is this for osd.307:
> 
>     "empty": 1,
>     "dne": 0,
>     "incomplete": 0,
>     "last_epoch_started": 0,
>     "hit_set_history": {
>         "current_last_update": "0'0",
>         "history": []
>     }
> }
> 
> last_epoch_started is 0 and empty is 1. The other OSDs are reporting last_epoch_started 16806 and empty 0.
> 
> I noticed that too and was wondering why it never completed recovery and joined
> 
> > If you stop osd.307 and maybe mark it as out, does that help?
> 
> No, I see the same thing I saw when I took 595 out:
> 
> [root@ceph-mon1 ~]# ceph pg map 1.323
> osdmap e22392 pg 1.323 (1.323) -> up [985,1391,240,127,937,362,267,320,7,634,716] acting [2147483647,1391,240,127,937,362,267,320,7,634,716]
> 
> Another OSD get chosen as the primary but never becomes acting on its own.
> 
> Another 11 PGs are reporting being undersized and having ITEM_NONE in their acting sets as well.
> 
> > ________________________________________
> > From: Wido den Hollander [wido@xxxxxxxx]
> > Sent: 22 February 2017 12:18
> > To: Vasilakakos, George (STFC,RAL,SC); ceph-users@xxxxxxxxxxxxxx
> > Subject: RE:  PG stuck peering after host reboot
> >
> > > Op 21 februari 2017 om 15:35 schreef george.vasilakakos@xxxxxxxxxx:
> > >
> > >
> > > I have noticed something odd with the ceph-objectstore-tool command:
> > >
> > > It always reports PG X not found even on healthly OSDs/PGs. The 'list' op works on both and unhealthy PGs.
> > >
> >
> > Are you sure you are supplying the correct PG ID?
> >
> > I just tested with (Jewel 10.2.5):
> >
> > $ ceph pg ls-by-osd 5
> > $ systemctl stop ceph-osd@5
> > $ ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-5 --op info --pgid 10.d0
> > $ systemctl start ceph-osd@5
> >
> > Can you double-check that?
> >
> > It's weird that the PG can't be found on those OSDs by the tool.
> >
> > Wido
> >
> >
> > > ________________________________________
> > > From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of george.vasilakakos@xxxxxxxxxx [george.vasilakakos@xxxxxxxxxx]
> > > Sent: 21 February 2017 10:17
> > > To: wido@xxxxxxxx; ceph-users@xxxxxxxxxxxxxx; bhubbard@xxxxxxxxxx
> > > Subject: Re:  PG stuck peering after host reboot
> > >
> > > > Can you for the sake of redundancy post your sequence of commands you executed and their output?
> > >
> > > [root@ceph-sn852 ~]# systemctl stop ceph-osd@307
> > > [root@ceph-sn852 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-307 --op info --pgid 1.323
> > > PG '1.323' not found
> > > [root@ceph-sn852 ~]# systemctl start ceph-osd@307
> > >
> > > I did the same thing for 307 (new up but not acting primary) and all the OSDs in the original set (including 595). The output was the exact same. I don't have the whole session log handy from all those sessions but here's a sample from one that's easy to pick out:
> > >
> > > [root@ceph-sn832 ~]# systemctl stop ceph-osd@7
> > > [root@ceph-sn832 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-7 --op info --pgid 1.323
> > > PG '1.323' not found
> > > [root@ceph-sn832 ~]# systemctl start ceph-osd@7
> > > [root@ceph-sn832 ~]# ll /var/lib/ceph/osd/ceph-7/current/
> > > 0.18_head/      11.1c8s5_TEMP/  13.3b_head/     1.74s1_TEMP/    2.256s6_head/   2.c3s10_TEMP/   3.b9s4_head/
> > > 0.18_TEMP/      1.16s1_head/    13.3b_TEMP/     1.8bs9_head/    2.256s6_TEMP/   2.c4s3_head/    3.b9s4_TEMP/
> > > 1.106s10_head/  1.16s1_TEMP/    1.3a6s0_head/   1.8bs9_TEMP/    2.2d5s2_head/   2.c4s3_TEMP/    4.34s10_head/
> > > 1.106s10_TEMP/  1.274s5_head/   1.3a6s0_TEMP/   2.174s10_head/  2.2d5s2_TEMP/   2.dbs7_head/    4.34s10_TEMP/
> > > 11.12as10_head/ 1.274s5_TEMP/   1.3e4s9_head/   2.174s10_TEMP/  2.340s8_head/   2.dbs7_TEMP/    commit_op_seq
> > > 11.12as10_TEMP/ 1.2ds8_head/    1.3e4s9_TEMP/   2.1c1s10_head/  2.340s8_TEMP/   3.159s3_head/   meta/
> > > 11.148s2_head/  1.2ds8_TEMP/    14.1a_head/     2.1c1s10_TEMP/  2.36es10_head/  3.159s3_TEMP/   nosnap
> > > 11.148s2_TEMP/  1.323s8_head/   14.1a_TEMP/     2.1d0s6_head/   2.36es10_TEMP/  3.170s1_head/   omap/
> > > 11.165s6_head/  1.323s8_TEMP/   1.6fs9_head/    2.1d0s6_TEMP/   2.3d3s10_head/  3.170s1_TEMP/
> > > 11.165s6_TEMP/  13.32_head/     1.6fs9_TEMP/    2.1efs2_head/   2.3d3s10_TEMP/  3.1aas5_head/
> > > 11.1c8s5_head/  13.32_TEMP/     1.74s1_head/    2.1efs2_TEMP/   2.c3s10_head/   3.1aas5_TEMP/
> > > [root@ceph-sn832 ~]# ll /var/lib/ceph/osd/ceph-7/current/1.323s8_
> > > 1.323s8_head/ 1.323s8_TEMP/
> > > [root@ceph-sn832 ~]# ll /var/lib/ceph/osd/ceph-7/current/1.323s8_head/DIR_3/DIR_2/DIR_
> > > DIR_3/ DIR_7/ DIR_B/ DIR_F/
> > > [root@ceph-sn832 ~]# ll /var/lib/ceph/osd/ceph-7/current/1.323s8_head/DIR_3/DIR_2/DIR_3/DIR_
> > > DIR_0/ DIR_1/ DIR_2/ DIR_3/ DIR_4/ DIR_5/ DIR_6/ DIR_7/ DIR_8/ DIR_9/ DIR_A/ DIR_B/ DIR_C/ DIR_D/ DIR_E/ DIR_F/
> > > [root@ceph-sn832 ~]# ll /var/lib/ceph/osd/ceph-7/current/1.323s8_head/DIR_3/DIR_2/DIR_3/DIR_1/
> > > total 271276
> > > -rw-r--r--. 1 ceph ceph 8388608 Feb  3 22:07 datadisk\srucio\sdata16\u13TeV\s11\sad\sDAOD\uTOPQ4.09383728.\u000436.pool.root.1.0000000000000001__head_2BA91323__1_ffffffffffffffff_8
> > >
> > > > If you run a find in the data directory of the OSD, does that PG show up?
> > >
> > > OSDs 595 (used to be 0), 1391(1), 240(2), 7(7, the one that started this) have a 1.323_headsX directory. OSD 307 does not.
> > > I have not checked the other OSDs in the PG yet.
> > >
> > > Wido
> > >
> > > >
> > > > Best regards,
> > > >
> > > > George
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com