pg stuck in peering while power failure

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi List,
 
I am testing the stability of my Ceph cluster with power failure.

I brutally powered off 2 Ceph units with each 90 OSDs on it while the client I/O was continuing.

Since then, some of the pgs of my cluster stucked in peering

      pgmap v3261136: 17408 pgs, 4 pools, 176 TB data, 5082 kobjects
            236 TB used, 5652 TB / 5889 TB avail
            8563455/38919024 objects degraded (22.003%)
               13526 active+undersized+degraded
                3769 active+clean
                 104 down+remapped+peering
                   9 down+peering

I queried the peering pg (all on EC pool with 7+2) and got blocked information (full query: http://pastebin.com/pRkaMG2h )

            "probing_osds": [
                "153(6)",
                "183(3)",
                "345(0)",
                "401(7)",
                "516(8)",
                "622(1)",
                "685(2)"
            ],
            "blocked": "peering is blocked due to down osds",
            "down_osds_we_would_probe": [
                792
            ],
            "peering_blocked_by": [
                {
                    "osd": 792,
                    "current_lost_at": 0,
                    "comment": "starting or marking this osd lost may let us proceed"
                }
            ]


osd.792 is exactly on one of the units I powered off. And I think the I/O associated with this pg is paused too.

I have checked the troubleshooting page on Ceph website ( http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/ ), it says that start the OSD or mark it lost can make the procedure continue.

I am sure that my cluster was healthy before the power outage occurred. I am wondering if the power outage really happens in production environment, will it also freeze my client I/O if I don't do anything? Since I just lost 2 redundancies (I have erasure code with 7+2), I think it should still serve normal functionality.

Or if I am doing something wrong? Please give me some suggestions, thanks.
 
Sincerely,
Craig Chi
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux