Hi List,
I am testing the stability of my Ceph cluster with power failure.
I brutally powered off 2 Ceph units with each 90 OSDs on it while the client I/O was continuing.
Since then, some of the pgs of my cluster stucked in peering
pgmap v3261136: 17408 pgs, 4 pools, 176 TB data, 5082 kobjects
236 TB used, 5652 TB / 5889 TB avail
8563455/38919024 objects degraded (22.003%)
13526 active+undersized+degraded
3769 active+clean
104 down+remapped+peering
9 down+peering
I queried the peering pg (all on EC pool with 7+2) and got blocked information (full query: http://pastebin.com/pRkaMG2h )
"probing_osds": [
"153(6)",
"183(3)",
"345(0)",
"401(7)",
"516(8)",
"622(1)",
"685(2)"
],
"blocked": "peering is blocked due to down osds",
"down_osds_we_would_probe": [
792
],
"peering_blocked_by": [
{
"osd": 792,
"current_lost_at": 0,
"comment": "starting or marking this osd lost may let us proceed"
}
]
osd.792 is exactly on one of the units I powered off. And I think the I/O associated with this pg is paused too.
I have checked the troubleshooting page on Ceph website ( http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/ ), it says that start the OSD or mark it lost can make the procedure continue.
I am sure that my cluster was healthy before the power outage occurred. I am wondering if the power outage really happens in production environment, will it also freeze my client I/O if I don't do anything? Since I just lost 2 redundancies (I have erasure code with 7+2), I think it should still serve normal functionality.
Or if I am doing something wrong? Please give me some suggestions, thanks.
I brutally powered off 2 Ceph units with each 90 OSDs on it while the client I/O was continuing.
Since then, some of the pgs of my cluster stucked in peering
pgmap v3261136: 17408 pgs, 4 pools, 176 TB data, 5082 kobjects
236 TB used, 5652 TB / 5889 TB avail
8563455/38919024 objects degraded (22.003%)
13526 active+undersized+degraded
3769 active+clean
104 down+remapped+peering
9 down+peering
I queried the peering pg (all on EC pool with 7+2) and got blocked information (full query: http://pastebin.com/pRkaMG2h )
"probing_osds": [
"153(6)",
"183(3)",
"345(0)",
"401(7)",
"516(8)",
"622(1)",
"685(2)"
],
"blocked": "peering is blocked due to down osds",
"down_osds_we_would_probe": [
792
],
"peering_blocked_by": [
{
"osd": 792,
"current_lost_at": 0,
"comment": "starting or marking this osd lost may let us proceed"
}
]
osd.792 is exactly on one of the units I powered off. And I think the I/O associated with this pg is paused too.
I have checked the troubleshooting page on Ceph website ( http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/ ), it says that start the OSD or mark it lost can make the procedure continue.
I am sure that my cluster was healthy before the power outage occurred. I am wondering if the power outage really happens in production environment, will it also freeze my client I/O if I don't do anything? Since I just lost 2 redundancies (I have erasure code with 7+2), I think it should still serve normal functionality.
Or if I am doing something wrong? Please give me some suggestions, thanks.
Sincerely,
Craig Chi
Craig Chi
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com