Re: Troubleshooting incomplete PG's

Brad Hubbard <bhubbard@xxxxxxxxxx> · Thu, 30 Mar 2017 12:02:37 +1000

On Thu, Mar 30, 2017 at 4:53 AM, nokia ceph <nokiacephusers@xxxxxxxxx> wrote:
> Hello,
>
> Env:-
> 5 node, EC 4+1 bluestore kraken v11.2.0 , RHEL7.2
>
> As part of our resillency testing with kraken bluestore, we face more PG's
> were in incomplete+remapped state. We tried to repair each PG using "ceph pg
> repair <pgid>" still no luck. Then we planned to remove incomplete PG's
> using below procedure.
>
>
> #ceph health detail | grep  1.e4b    
> pg 1.e4b is remapped+incomplete, acting [2147483647,66,15,73,2147483647]
> (reducing pool cdvr_ec min_size from 4 may help; search ceph.com/docs for
> 'incomplete')

"Incomplete Ceph detects that a placement group is missing information about
writes that may have occurred, or does not have any healthy copies. If you see
this state, try to start any failed OSDs that may contain the needed
information."

>
> Here we shutdown the OSD's 66,15 and 73 then proceeded with below operation.
>
> #ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-135 --op list-pgs
> #ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-135 --pgid 1.e4b
> --op remove
>
> Please confirm that we are following the correct procedure to removal of
> PG's

There are multiple threads about that on this very list "pgs stuck inactive"
recently for example.

>
> #ceph pg stat
> v2724830: 4096 pgs: 1 active+clean+scrubbing+deep+repair, 1 down+remapped,
> 21 remapped+incomplete, 4073 active+clean; 268 TB data, 371 TB used, 267 TB
> / 638 TB avail
>
> # ceph -s
> 2017-03-29 18:23:44.288508 7f8c2b8e5700 -1 WARNING: the following dangerous
> and experimental features are enabled: bluestore,rocksdb
> 2017-03-29 18:23:44.304692 7f8c2b8e5700 -1 WARNING: the following dangerous
> and experimental features are enabled: bluestore,rocksdb
>     cluster bd8adcd0-c36d-4367-9efe-f48f5ab5f108
>      health HEALTH_ERR
>             22 pgs are stuck inactive for more than 300 seconds
>             1 pgs down
>             21 pgs incomplete
>             1 pgs repair
>             22 pgs stuck inactive
>             22 pgs stuck unclean
>      monmap e2: 5 mons at
> {au-adelaide=10.50.21.24:6789/0,au-brisbane=10.50.21.22:6789/0,au-canberra=10.50.21.23:6789/0,au-melbourne=10.50.21.21:6789/0,au-sydney=10.50.21.20:6789/0}
>             election epoch 172, quorum 0,1,2,3,4
> au-sydney,au-melbourne,au-brisbane,au-canberra,au-adelaide
>         mgr active: au-brisbane
>      osdmap e6284: 118 osds: 117 up, 117 in; 22 remapped pgs

What is the status of the down+out osd? What role did/does it play? Most
importantly, is it osd.6?

>             flags sortbitwise,require_jewel_osds,require_kraken_osds
>       pgmap v2724830: 4096 pgs, 1 pools, 268 TB data, 197 Mobjects
>             371 TB used, 267 TB / 638 TB avail
>                 4073 active+clean
>                   21 remapped+incomplete
>                    1 down+remapped
>                    1 active+clean+scrubbing+deep+repair
>
>
> #ceph osd dump | grep pool
> pool 1 'cdvr_ec' erasure size 5 min_size 4 crush_ruleset 1 object_hash
> rjenkins pg_num 4096 pgp_num 4096 last_change 456 flags
> hashpspool,nodeep-scrub stripe_width 65536
>
>
>
> Can you please suggest is there any way to wipe out these incomplete PG's.

See the thread previously mentioned. Take note of the force_create_pg step.

> Why ceph pg repair failed in this scenerio?
> How to recover incomplete PG's to active state.
>
> pg query for the affected PG ended with this error. Can you please explain
> what is meant by this ?
> ---
>                 "15(2)",
>                 "66(1)",
>                 "73(3)",
>                 "103(4)",
>                 "113(0)"
>             ],
>             "down_osds_we_would_probe": [
>                 6
>             ],
>             "peering_blocked_by": [],
>             "peering_blocked_by_detail": [
>                 {
>                     "detail": "peering_blocked_by_history_les_bound"
>                 }
> ----

During multiple intervals osd 6 was in the up/acting set, for example;

                {
                    "first": 1608,
                    "last": 1645,
                    "maybe_went_rw": 1,
                    "up": [
                        113,
                        6,
                        15,
                        73,
                        103
                    ],
                    "acting": [
                        113,
                        6,
                        15,
                        73,
                        103
                    ],
                    "primary": 113,
                    "up_primary": 113
                },

Because we may have gone rw during that interval we need to query it and it is blocking progress.

            "blocked_by": [
                6
            ],

Setting osd_find_best_info_ignore_history_les to true may help but then you may
need to mark the missing OSD lost or perform some other trickery (and this . I
suspect your min_size is too low, especially for a cluster of this size, but EC
is not an area I know extensively so I can't say definitively. Some of your
questions may be better suited to the ceph-devel mailing list by the way.

>
> Attaching "ceph pg 1.e4b query > /tmp/1.e4b-pg.txt" file with this mail.
>
> Thanks
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com