Re: PG stuck peering after host reboot

<george.vasilakakos@xxxxxxxxxx> · Thu, 16 Feb 2017 13:55:53 +0000

Hi folks,

I have just made a tracker for this issue: http://tracker.ceph.com/issues/18960
I used ceph-post-file to upload some logs from the primary OSD for the troubled PG.

Any help would be appreciated.

If we can't get it to peer, we'd like to at least get it unstuck, even if it means data loss.

What's the proper way to go about doing that?

Best regards,

George
________________________________________
From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of george.vasilakakos@xxxxxxxxxx [george.vasilakakos@xxxxxxxxxx]
Sent: 14 February 2017 10:27
To: bhubbard@xxxxxxxxxx; ceph-users@xxxxxxxxxxxxxx
Subject: Re:  PG stuck peering after host reboot

Hi Brad,

I'll be doing so later in the day.

Thanks,

George
________________________________________
From: Brad Hubbard [bhubbard@xxxxxxxxxx]
Sent: 13 February 2017 22:03
To: Vasilakakos, George (STFC,RAL,SC); Ceph Users
Subject: Re:  PG stuck peering after host reboot

I'd suggest creating a tracker and uploading a full debug log from the
primary so we can look at this in more detail.

On Mon, Feb 13, 2017 at 9:11 PM,  <george.vasilakakos@xxxxxxxxxx> wrote:
> Hi Brad,
>
> I could not tell you that as `ceph pg 1.323 query` never completes, it just hangs there.
>
> On 11/02/2017, 00:40, "Brad Hubbard" <bhubbard@xxxxxxxxxx> wrote:
>
>     On Thu, Feb 9, 2017 at 3:36 AM,  <george.vasilakakos@xxxxxxxxxx> wrote:
>     > Hi Corentin,
>     >
>     > I've tried that, the primary hangs when trying to injectargs so I set the option in the config file and restarted all OSDs in the PG, it came up with:
>     >
>     > pg 1.323 is remapped+peering, acting [595,1391,2147483647,127,937,362,267,320,7,634,716]
>     >
>     > Still can't query the PG, no error messages in the logs of osd.240.
>     > The logs on osd.595 and osd.7 still fill up with the same messages.
>
>     So what does "peering_blocked_by_detail" show in that case since it
>     can no longer show "peering_blocked_by_history_les_bound"?
>
>     >
>     > Regards,
>     >
>     > George
>     > ________________________________
>     > From: Corentin Bonneton [list@xxxxxxxx]
>     > Sent: 08 February 2017 16:31
>     > To: Vasilakakos, George (STFC,RAL,SC)
>     > Cc: ceph-users@xxxxxxxxxxxxxx
>     > Subject: Re:  PG stuck peering after host reboot
>     >
>     > Hello,
>     >
>     > I already had the case, I applied the parameter (osd_find_best_info_ignore_history_les) to all the osd that have reported the queries blocked.
>     >
>     > --
>     > Cordialement,
>     > CEO FEELB | Corentin BONNETON
>     > contact@xxxxxxxx<mailto:contact@xxxxxxxx>
>     >
>     > Le 8 févr. 2017 à 17:17, george.vasilakakos@xxxxxxxxxx<mailto:george.vasilakakos@xxxxxxxxxx> a écrit :
>     >
>     > Hi Ceph folks,
>     >
>     > I have a cluster running Jewel 10.2.5 using a mix EC and replicated pools.
>     >
>     > After rebooting a host last night, one PG refuses to complete peering
>     >
>     > pg 1.323 is stuck inactive for 73352.498493, current state peering, last acting [595,1391,240,127,937,362,267,320,7,634,716]
>     >
>     > Restarting OSDs or hosts does nothing to help, or sometimes results in things like this:
>     >
>     > pg 1.323 is remapped+peering, acting [2147483647,1391,240,127,937,362,267,320,7,634,716]
>     >
>     >
>     > The host that was rebooted is home to osd.7 (8). If I go onto it to look at the logs for osd.7 this is what I see:
>     >
>     > $ tail -f /var/log/ceph/ceph-osd.7.log
>     > 2017-02-08 15:41:00.445247 7f5fcc2bd700  0 -- XXX.XXX.XXX.172:6905/20510 >> XXX.XXX.XXX.192:6921/55371 pipe(0x7f6074a0b400 sd=34 :42828 s=2 pgs=319 cs=471 l=0 c=0x7f6070086700).fault, initiating reconnect
>     >
>     > I'm assuming that in IP1:port1/PID1 >> IP2:port2/PID2 the >> indicates the direction of communication. I've traced these to osd.7 (rank 8 in the stuck PG) reaching out to osd.595 (the primary in the stuck PG).
>     >
>     > Meanwhile, looking at the logs of osd.595 I see this:
>     >
>     > $ tail -f /var/log/ceph/ceph-osd.595.log
>     > 2017-02-08 15:41:15.760708 7f1765673700  0 -- XXX.XXX.XXX.192:6921/55371 >> XXX.XXX.XXX.172:6905/20510 pipe(0x7f17b2911400 sd=101 :6921 s=0 pgs=0 cs=0 l=0 c=0x7f17b7beaf00).accept connect_seq 478 vs existing 477 state standby
>     > 2017-02-08 15:41:20.768844 7f1765673700  0 bad crc in front 1941070384 != exp 3786596716
>     >
>     > which again shows osd.595 reaching out to osd.7 and from what I could gather the CRC problem is about messaging.
>     >
>     > Google searching has yielded nothing particularly useful on how to get this unstuck.
>     >
>     > ceph pg 1.323 query seems to hang forever but it completed once last night and I noticed this:
>     >
>     >            "peering_blocked_by_detail": [
>     >                {
>     >                    "detail": "peering_blocked_by_history_les_bound"
>     >                }
>     >
>     > We have seen this before and it was cleared by setting osd_find_best_info_ignore_history_les to true for the first two OSDs on the stuck PGs (this was on a 3 replica pool). This hasn't worked in this case and I suspect the option needs to be set on either a majority of OSDs or enough k number of OSDs to be able to use their data and ignore history.
>     >
>     > We would really appreciate any guidance and/or help the community can offer!
>     >
>     > _______________________________________________
>     > ceph-users mailing list
>     > ceph-users@xxxxxxxxxxxxxx
>     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>     --
>     Cheers,
>     Brad
>
>

--
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com