`ceph health detail` will give you some more information on the blocked requests. Depending on what that shows you can often find the OSD that is causing the problems. But your biggest problem is that you have dishes with potentially inconsistent data in your closer.
On Sun, Sep 2, 2018, 4:42 AM Lee <lquince@xxxxxxxxx> wrote:
_______________________________________________Running 0.94.5 as part of a Openstack enviroment, our ceph setup is 3x OSD Nodes 3x MON Nodes, yesterday we had a aircon outage in our hosting enviroment, 1 OSD node failed (offline with a the journal SSD dead) left with 2 nodes running correctly, 2 hours later a second OSD node failed complaining of readwrite errors to the physical drives, i assume this was a heat issue as when rebooted this came back online ok and ceph started to repair itself. We have since brought the first failed node back on by replacing the ssd and recreating the journals hoping it would all repair.. Our pools are min 2 repl.
The problem we have is client IO (read) is totally blocked, and when I query the stuck PG's it just hangs..
For example the check version command just errors with:Error EINTR: problem getting command descriptions from on various OSD's so I cannot even query the inactive PG'sroot@node31-a4:~# ceph -scluster 7c24e1b9-24b3-4a1b-8889-9b2d7fd88cd2health HEALTH_WARN83 pgs backfill2 pgs backfill_toofull3 pgs backfilling48 pgs degraded1 pgs down31 pgs incomplete1 pgs recovering29 pgs recovery_wait1 pgs stale48 pgs stuck degraded31 pgs stuck inactive1 pgs stuck stale148 pgs stuck unclean17 pgs stuck undersized17 pgs undersized599 requests are blocked > 32 secrecovery 111489/4697618 objects degraded (2.373%)recovery 772268/4697618 objects misplaced (16.440%)recovery 1/2171314 unfound (0.000%)monmap e5: 3 mons at {bc07s12-a7=172.27.16.11:6789/0,bc07s13-a7=172.27.16.21:6789/0,bc07s14-a7=172.27.16.15:6789/0}election epoch 198, quorum 0,1,2 bc07s12-a7,bc07s14-a7,bc07s13-a7osdmap e18727: 25 osds: 25 up, 25 in; 90 remapped pgspgmap v70996322: 1792 pgs, 13 pools, 8210 GB data, 2120 kobjects16783 GB used, 6487 GB / 23270 GB avail111489/4697618 objects degraded (2.373%)772268/4697618 objects misplaced (16.440%)1/2171314 unfound (0.000%)1639 active+clean66 active+remapped+wait_backfill30 incomplete25 active+recovery_wait+degraded15 active+undersized+degraded+remapped+wait_backfill4 active+recovery_wait+degraded+remapped4 active+clean+scrubbing2 active+remapped+wait_backfill+backfill_toofull1 down+incomplete1 active+remapped+backfilling1 active+clean+scrubbing+deep1 stale+active+undersized+degraded1 active+undersized+degraded+remapped+backfilling1 active+degraded+remapped+backfilling1 active+recovering+degradedrecovery io 29385 kB/s, 7 objects/sclient io 5877 B/s wr, 1 op/s
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com