Re: pg stuck in peering while power failure

Craig Chi <craigchi@xxxxxxxxxxxx> · Wed, 11 Jan 2017 15:44:50 +0800

Hi Sam,

Thank you for your precise inspection.

I reviewed the log at the time, and I discovered that the cluster failed a OSD just after I shut the first unit down. Thus as you said, the pg can't finish peering due to the second unit was then shut off suddenly.

Much appreciate your advice, but I aim to keep my cluster working when 2 storage nodes are down. The unexpected OSD failed with the following log just at the time I shut the first unit down:

2017-01-10 12:30:07.905562 mon.1 172.20.1.3:6789/0 28484 : cluster [INF] osd.153 172.20.3.2:6810/26796 failed (2 reporters from different host after 20.072026 >= grace 20.000000)

But that OSD was not dead actually, more likely had slow response to heartbeats. What I think is increasing the osd_heartbeat_grace may somehow mitigate the issue.

Sincerely,

Craig Chi

On 2017-01-11 00:08, Samuel Just <sjust@xxxxxxxxxx> wrote:

        {
            "name": "Started\/Primary\/Peering",
            "enter_time": "2017-01-10 13:43:34.933074",
            "past_intervals": [
                {
                    "first": 75858,
                    "last": 75860,
                    "maybe_went_rw": 1,
                    "up": [
                        345,
                        622,
                        685,
                        183,
                        792,
                        2147483647,
                        2147483647,
                        401,
                        516
                    ],
                    "acting": [
                        345,
                        622,
                        685,
                        183,
                        792,
                        2147483647,
                        2147483647,
                        401,
                        516
                    ],
                    "primary": 345,
                    "up_primary": 345
                },

Between 75858 and 75860,

                        345,
                        622,
                        685,
                        183,
                        792,
                        2147483647,
                        2147483647,
                        401,
                        516

was the acting set.  The current acting set

                    345,
                    622,
                    685,
                    183,
                    2147483647,
                    2147483647,
                    153,
                    401,
                    516

needs *all 7* of the osds from epochs 75858 through 75860 to ensure
that it has any writes completed during that time.  You can make
transient situations like that less of a problem by setting min_size
to 8 (though it'll prevent writes with 2 failures until backfill
completes).  A possible enhancement for an EC pool would be to gather
the infos from those osds anyway and use that rule out writes (if they
actually happened, you'd still be stuck).
-Sam

On Tue, Jan 10, 2017 at 5:36 AM, Craig Chi <craigchi@xxxxxxxxxxxx> wrote:
> Hi List,
>
> I am testing the stability of my Ceph cluster with power failure.
>
> I brutally powered off 2 Ceph units with each 90 OSDs on it while the client
> I/O was continuing.
>
> Since then, some of the pgs of my cluster stucked in peering
>
>       pgmap v3261136: 17408 pgs, 4 pools, 176 TB data, 5082 kobjects
>             236 TB used, 5652 TB / 5889 TB avail
>             8563455/38919024 objects degraded (22.003%)
>                13526 active+undersized+degraded
>                 3769 active+clean
>                  104 down+remapped+peering
>                    9 down+peering
>
> I queried the peering pg (all on EC pool with 7+2) and got blocked
> information (full query: http://pastebin.com/pRkaMG2h )
>
>             "probing_osds": [
>                 "153(6)",
>                 "183(3)",
>                 "345(0)",
>                 "401(7)",
>                 "516(8)",
>                 "622(1)",
>                 "685(2)"
>             ],
>             "blocked": "peering is blocked due to down osds",
>             "down_osds_we_would_probe": [
>                 792
>             ],
>             "peering_blocked_by": [
>                 {
>                     "osd": 792,
>                     "current_lost_at": 0,
>                     "comment": "starting or marking this osd lost may let us
> proceed"
>                 }
>             ]
>
>
> osd.792 is exactly on one of the units I powered off. And I think the I/O
> associated with this pg is paused too.
>
> I have checked the troubleshooting page on Ceph website (
> http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/
> ), it says that start the OSD or mark it lost can make the procedure
> continue.
>
> I am sure that my cluster was healthy before the power outage occurred. I am
> wondering if the power outage really happens in production environment, will
> it also freeze my client I/O if I don't do anything? Since I just lost 2
> redundancies (I have erasure code with 7+2), I think it should still serve
> normal functionality.
>
> Or if I am doing something wrong? Please give me some suggestions, thanks.
>
> Sincerely,
> Craig Chi
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com