PG stuck down after OSD failures and recovery

Krzysztof Klimonda <kklimonda@xxxxxxxxxxxxxxxxxxxxx> · Mon, 20 May 2019 23:15:17 +0200

Hi,
We’ve observed the following chain of events on our luminous (12.2.8) cluster:

To summarize - we’re running this pool with min_size = 1, size = 2. On friday we’ve “temporarily" lost osd.52 (disk changed letter and the host had to be restarted, which we planned on doing after weekend), which was primary for PG `3.1795`. On sunday we’ve lost osd.501 that became primary for that PG. Today, after we’ve restored osd.51 it tried peering with osd.501 to get the log(?) and got stuck and marked down:

——8<——8<——
    "recovery_state": [
        {
            "name": "Started/Primary/Peering/Down",
            "enter_time": "2019-05-20 09:24:23.907107",
            "comment": "not enough up instances of this PG to go active"
        },
        {
            "name": "Started/Primary/Peering",
            "enter_time": "2019-05-20 09:24:23.907055",
            "past_intervals": [
                {
                    "first": "196378",
                    "last": "196975",
                    "all_participants": [
                        {
                            "osd": 52
                        },
                        {
                            "osd": 448
                        },
                        {
                            "osd": 501
                        },
                        {
                            "osd": 635
                        }
                    ],
                    "intervals": [
                        {
                            "first": "196833",
                            "last": "196834",
                            "acting": "501"
                        },
                        {
                            "first": "196868",
                            "last": "196870",
                            "acting": "448"
                        },
                        {
                            "first": "196871",
                            "last": "196975",
                            "acting": "448,635"
                        }
                    ]
                }
            ],
            "probing_osds": [
                "52",
                "448",
                "635"
            ],
            "blocked": "peering is blocked due to down osds",
            "down_osds_we_would_probe": [
                501
            ],
            "peering_blocked_by": [
                {
                    "osd": 501,
                    "current_lost_at": 0,
                    "comment": "starting or marking this osd lost may let us proceed"
                }
            ]
        },
        {
            "name": "Started",
            "enter_time": "2019-05-20 09:24:23.907017"
        }
    ],
——8<——8<——

For the detailed description see below:

1. We start with an available PG `3.1795`:

ceph-osd.501.log.2.gz:2019-05-17 15:18:09.518358 7f9a09125700  1 osd.501 pg_epoch: 196830 pg[3.1795( v 196821'122308 (179253'120770,196821'122308] local-lis/les=196378/196379 n=17 ec=100709/39 lis/c 196378/196378 les/c/f 196379/196379/0 196830/196830/196830) [501] r=0 lpr=196830 pi=[196378,196830)/1 luod=0'0 crt=196821'122308 lcod 196821'122307 mlcod 0'0 active] start_peering_interval up [52,501] -> [501], acting [52,501] -> [501], acting_primary 52 -> 501, up_primary 52 -> 501, role 1 -> 0, features acting 4611087853745930235 upacting 4611087853745930235

2. One of SSDs serving OSD disappears from the system, bringing down the osd.51:

(log from its peer)
ceph-osd.501.log.2.gz:2019-05-17 15:18:09.518358 7f9a09125700  1 osd.501 pg_epoch: 196830 pg[3.1795( v 196821'122308 (179253'120770,196821'122308] local-lis/les=196378/196379 n=17 ec=100709/39 lis/c 196378/196378 les/c/f 196379/196379/0 196830/196830/196830) [501] r=0 lpr=196830 pi=[196378,196830)/1 luod=0'0 crt=196821'122308 lcod 196821'122307 mlcod 0'0 active] start_peering_interval up [52,501] -> [501], acting [52,501] -> [501], acting_primary 52 -> 501, up_primary 52 -> 501, role 1 -> 0, features acting 4611087853745930235 upacting 4611087853745930235

3. 10 minutes later osd.501 has had enough, a new second replica for PG is chosen:

ceph-osd.501.log.2.gz:2019-05-17 15:28:10.308940 7f9a08924700  1 osd.501 pg_epoch: 196832 pg[3.1795( v 196821'122308 (179253'120770,196821'122308] local-lis/les=196830/196831 n=17 ec=100709/39 lis/c 196830/196378 les/c/f 196831/196379/0 196832/196832/196830) [501,448] r=0 lpr=196832 pi=[196378,196832)/1 luod=0'0 crt=196821'122308 lcod 196821'122307 mlcod 0'0 active] start_peering_interval up [501] -> [501,448], acting [501] -> [501,448], acting_primary 501 -> 501, up_primary 501 -> 501, role 0 -> 0, features acting 4611087853745930235 upacting 4611087853745930235

4. Two days later, we lose osd.501 and osd.448 becomes sole OSD for that PG:

ceph-osd.448.log.1.gz:2019-05-19 13:08:36.018264 7fae789ee700  1 osd.448 pg_epoch: 196852 pg[3.1795( v 196821'122308 (179253'120808,196821'122308] local-lis/les=196835/196836 n=17 ec=100709/39 lis/c 196835/196835 les/c/f 196836/196836/0 196852/196852/196852) [448] r=0 lpr=196852 pi=[196835,196852)/1 luod=0'0 crt=196821'122308 mlcod 0'0 active] start_peering_interval up [501,448] -> [448], acting [501,448] -> [448], acting_primary 501 -> 448, up_primary 501 -> 448, role 1 -> 0, features acting 4611087853745930235 upacting 4611087853745930235

5. 10 minutes later a new OSD is chosen for the second replica:

ceph-osd.448.log.1.gz:2019-05-19 13:18:47.587437 7fae781ed700  1 osd.448 pg_epoch: 196867 pg[3.1795( v 196821'122308 (179253'120808,196821'122308] local-lis/les=196852/196853 n=17 ec=100709/39 lis/c 196852/196835 les/c/f 196853/196836/0 196867/196867/196852) [448,635] r=0 lpr=196867 pi=[196835,196867)/1 luod=0'0 crt=196821'122308 mlcod 0'0 active] start_peering_interval up [448] -> [448,635], acting [448] -> [448,635], acting_primary 448 -> 448, up_primary 448 -> 448, role 0 -> 0, features acting 4611087853745930235 upacting 4611087853745930235

6. (I don’t have access to the host running osd.635 right now but I assume it’s the same set of events there)

7. On monday, osd.52 is brought back online and it transitions itself into primary for that PG:

2019-05-20 09:24:14.639568 7f8be9aa1700  1 osd.52 pg_epoch: 196829 pg[3.1795( v 196821'122308 (179253'120770,196821'122308] local-lis/les=196378/196379 n=17 ec=100709/39 lis/c 196378/196378 les/c/f 196379/196379/0 196378/196378/196378) [52,501] r=0 lpr=196821 crt=196821'122308 lcod 0'0 mlcod 0'0 unknown] state<Start>: transitioning to Primary

but it’s failing:

2019-05-20 09:24:14.981968 7f8bea2a2700  1 osd.52 pg_epoch: 196830 pg[3.1795( v 196821'122308 (179253'120770,196821'122308] local-lis/les=196378/196379 n=17 ec=100709/39 lis/c 196378/196378 les/c/f 196379/196379/0 196830/196830/196830) [501] r=-1 lpr=196830 pi=[196378,196830)/1 crt=196821'122308 lcod 0'0 unknown] start_peering_interval up [52,501] -> [501], acting [52,501] -> [501], acting_primary 52 -> 501, up_primary 52 -> 501, role 0 -> -1, features acting 4611087853745930235 upacting 4611087853745930235
2019-05-20 09:24:14.982026 7f8bea2a2700  1 osd.52 pg_epoch: 196832 pg[3.1795( v 196821'122308 (179253'120770,196821'122308] local-lis/les=196378/196379 n=17 ec=100709/39 lis/c 196378/196378 les/c/f 196379/196379/0 196832/196832/196830) [501,448] r=-1 lpr=196832 pi=[196378,196832)/1 crt=196821'122308 lcod 0'0 unknown NOTIFY] start_peering_interval up [501] -> [501,448], acting [501] -> [501,448], acting_primary 501 -> 501, up_primary 501 -> 501, role -1 -> -1, features acting 4611087853745930235 upacting 4611087853745930235
2019-05-20 09:24:14.982063 7f8bea2a2700  1 osd.52 pg_epoch: 196833 pg[3.1795( v 196821'122308 (179253'120770,196821'122308] local-lis/les=196378/196379 n=17 ec=100709/39 lis/c 196378/196378 les/c/f 196379/196379/0 196832/196833/196830) [501,448]/[501] r=-1 lpr=196833 pi=[196378,196833)/1 crt=196821'122308 lcod 0'0 remapped NOTIFY] start_peering_interval up [501,448] -> [501,448], acting [501,448] -> [501], acting_primary 501 -> 501, up_primary 501 -> 501, role -1 -> -1, features acting 4611087853745930235 upacting 4611087853745930235
2019-05-20 09:24:14.982114 7f8bea2a2700  1 osd.52 pg_epoch: 196835 pg[3.1795( v 196821'122308 (179253'120770,196821'122308] local-lis/les=196378/196379 n=17 ec=100709/39 lis/c 196378/196378 les/c/f 196379/196379/0 196832/196835/196830) [501,448] r=-1 lpr=196835 pi=[196378,196835)/1 crt=196821'122308 lcod 0'0 unknown NOTIFY] start_peering_interval up [501,448] -> [501,448], acting [501] -> [501,448], acting_primary 501 -> 501, up_primary 501 -> 501, role -1 -> -1, features acting 4611087853745930235 upacting 4611087853745930235
2019-05-20 09:24:14.982371 7f8bea2a2700  1 osd.52 pg_epoch: 196852 pg[3.1795( v 196821'122308 (179253'120770,196821'122308] local-lis/les=196378/196379 n=17 ec=100709/39 lis/c 196378/196378 les/c/f 196379/196379/0 196852/196852/196852) [448] r=-1 lpr=196852 pi=[196378,196852)/2 crt=196821'122308 lcod 0'0 unknown NOTIFY] start_peering_interval up [501,448] -> [448], acting [501,448] -> [448], acting_primary 501 -> 448, up_primary 501 -> 448, role -1 -> -1, features acting 4611087853745930235 upacting 4611087853745930235
2019-05-20 09:24:14.982595 7f8bea2a2700  1 osd.52 pg_epoch: 196867 pg[3.1795( v 196821'122308 (179253'120770,196821'122308] local-lis/les=196378/196379 n=17 ec=100709/39 lis/c 196378/196378 les/c/f 196379/196379/0 196867/196867/196852) [448,635] r=-1 lpr=196867 pi=[196378,196867)/2 crt=196821'122308 lcod 0'0 unknown NOTIFY] start_peering_interval up [448] -> [448,635], acting [448] -> [448,635], acting_primary 448 -> 448, up_primary 448 -> 448, role -1 -> -1, features acting 4611087853745930235 upacting 4611087853745930235
2019-05-20 09:24:14.982636 7f8bea2a2700  1 osd.52 pg_epoch: 196868 pg[3.1795( v 196821'122308 (179253'120770,196821'122308] local-lis/les=196378/196379 n=17 ec=100709/39 lis/c 196378/196378 les/c/f 196379/196379/0 196867/196868/196852) [448,635]/[448] r=-1 lpr=196868 pi=[196378,196868)/2 crt=196821'122308 lcod 0'0 remapped NOTIFY] start_peering_interval up [448,635] -> [448,635], acting [448,635] -> [448], acting_primary 448 -> 448, up_primary 448 -> 448, role -1 -> -1, features acting 4611087853745930235 upacting 4611087853745930235
2019-05-20 09:24:14.982700 7f8bea2a2700  1 osd.52 pg_epoch: 196869 pg[3.1795( v 196821'122308 (179253'120770,196821'122308] local-lis/les=196378/196379 n=17 ec=100709/39 lis/c 196378/196378 les/c/f 196379/196379/0 196867/196868/196852) [448,635]/[448] r=-1 lpr=196868 pi=[196378,196868)/2 crt=196821'122308 lcod 0'0 remapped NOTIFY] state<Start>: transitioning to Stray
2019-05-20 09:24:15.340453 7f8be9aa1700  1 osd.52 pg_epoch: 196871 pg[3.1795( v 196821'122308 (179253'120770,196821'122308] local-lis/les=196378/196379 n=17 ec=100709/39 lis/c 196378/196378 les/c/f 196379/196379/0 196867/196871/196852) [448,635] r=-1 lpr=196871 pi=[196378,196871)/2 crt=196821'122308 lcod 0'0 unknown NOTIFY] start_peering_interval up [448,635] -> [448,635], acting [448] -> [448,635], acting_primary 448 -> 448, up_primary 448 -> 448, role -1 -> -1, features acting 4611087853745930235 upacting 4611087853745930235
2019-05-20 09:24:15.341313 7f8be9aa1700  1 osd.52 pg_epoch: 196909 pg[3.1795( v 196821'122308 (179253'120770,196821'122308] local-lis/les=196378/196379 n=17 ec=100709/39 lis/c 196378/196378 les/c/f 196379/196379/0 196867/196871/196852) [448,635] r=-1 lpr=196871 pi=[196378,196871)/2 crt=196821'122308 lcod 0'0 unknown NOTIFY] state<Start>: transitioning to Stray

From the log it seems that once osd.52 came back online it tried to peer with all OSDs that are part of the past intervals, I assume it tries to get a list of data that has changed so it can then sync it from currently acting PGs.

I have two questions:

1. Is my understanding of the chaing of events here correct? osd.52 already had some data for the PG, so it tried reaching out to all OSDs that had this data between its failure and recovery, to get a list of “objects” to download, instead of downloading everything?
2. If that a limitation of running pool with min_size = 1? Why didn’t osd.52 simply gave up on peering, dropped all its prior knowledge of the PG in question, and redownloaded everything from currently acting OSD? Perhaps it’s a bug instead of a limitation?

Best Regards,
  Chris
Attachment:
signature.asc

Description: Message signed with OpenPGP
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com