Cześć, > On Wed, 24 May 2017, Łukasz Chrustek wrote: >> Cześć, >> >> > On Wed, 24 May 2017, Łukasz Chrustek wrote: >> >> Cześć, >> >> >> >> > On Tue, 23 May 2017, Łukasz Chrustek wrote: >> >> >> Cześć, >> >> >> >> >> >> > On Tue, 23 May 2017, Łukasz Chrustek wrote: >> >> >> >> I'm not sleeping for over 30 hours, and still can't find solution. I >> >> >> >> did, as You wrote, but turning off this >> >> >> >> (https://pastebin.com/1npBXeMV) osds didn't resolve issue... >> >> >> >> >> >> > The important bit is: >> >> >> >> >> >> > "blocked": "peering is blocked due to down osds", >> >> >> > "down_osds_we_would_probe": [ >> >> >> > 6, >> >> >> > 10, >> >> >> > 33, >> >> >> > 37, >> >> >> > 72 >> >> >> > ], >> >> >> > "peering_blocked_by": [ >> >> >> > { >> >> >> > "osd": 6, >> >> >> > "current_lost_at": 0, >> >> >> > "comment": "starting or marking this osd lost may let >> >> >> > us proceed" >> >> >> > }, >> >> >> > { >> >> >> > "osd": 10, >> >> >> > "current_lost_at": 0, >> >> >> > "comment": "starting or marking this osd lost may let >> >> >> > us proceed" >> >> >> > }, >> >> >> > { >> >> >> > "osd": 37, >> >> >> > "current_lost_at": 0, >> >> >> > "comment": "starting or marking this osd lost may let >> >> >> > us proceed" >> >> >> > }, >> >> >> > { >> >> >> > "osd": 72, >> >> >> > "current_lost_at": 113771, >> >> >> > "comment": "starting or marking this osd lost may let >> >> >> > us proceed" > These are the osds (6, 10, 37, 72). >> >> >> > } >> >> >> > ] >> >> >> > }, >> >> >> >> >> >> > Are any of those OSDs startable? > This osd 6 - isn't startable osd 10, 37, 72 are startable >> >> >> >> >> >> They were all up and running - but I decided to shut them down and out >> >> >> them from ceph, now it looks like ceph working ok, but still two PGs >> >> >> are in down state, how to get rid of it ? >> >> >> >> > If you haven't deleted the data, you should start the OSDs back up. > This By OSDs backup You mean copy /var/lib/ceph/osd/ceph-72/* to some other (non ceph) disk ? >> >> >> >> > If they are partially damanged you can use ceph-objectstore-tool to >> >> > extract just the PGs in question to make sure you haven't lost anything, >> >> > inject them on some other OSD(s) and restart those, and *then* mark the >> >> > bad OSDs as 'lost'. >> >> >> >> > If all else fails, you can just mark those OSDs 'lost', but in doing so >> >> > you might be telling the cluster to lose data. >> >> >> >> > The best thing to do is definitely to get those OSDs started again. > This There were actions on this PGs, that make them destroy. I started this osds (these three, which are startable) - this dosn't solved situation. I need to add, that on this cluster are other pools, only with pool with broken/down PGs is problem. >> >> >> >> Now situation looks like this: >> >> >> >> [root@cc1 ~]# rbd info volumes/volume-ccc5d976-cecf-4938-a452-1bee6188987b >> >> rbd image 'volume-ccc5d976-cecf-4938-a452-1bee6188987b': >> >> size 500 GB in 128000 objects >> >> order 22 (4096 kB objects) >> >> block_name_prefix: rbd_data.ed9d394a851426 >> >> format: 2 >> >> features: layering >> >> flags: >> >> >> >> [root@cc1 ~]# rados -p volumes ls | grep rbd_data.ed9d394a851426 >> >> (output cutted) >> >> rbd_data.ed9d394a851426.000000000000447c >> >> rbd_data.ed9d394a851426.0000000000010857 >> >> rbd_data.ed9d394a851426.000000000000ec8b >> >> rbd_data.ed9d394a851426.000000000000fa43 >> >> rbd_data.ed9d394a851426.000000000001ef2d >> >> ^C >> >> >> >> it hangs on this object and isn't going further. rbd cp also hangs... >> >> rbd map - also... >> >> >> >> can You advice what can be solution for this case ? >> >> > The hang is due to OSD throttling (see my first reply for how to wrok >> > around that and get a pg query). But you already did that and the cluster >> > told you which OSDs it needs to see up in order for it to peer and >> > recover. If you haven't destroyed those disks, you should start those >> > osds and it shoudl be fine. If you've destroyed the data or the disks are >> > truly broken and dead, then you can mark those OSDs lost and the cluster >> > *maybe* recover (but hard to say given the information you've shared). > This [root@cc1 ~]# ceph osd lost 10 --yes-i-really-mean-it marked osd lost in epoch 115310 [root@cc1 ~]# ceph osd lost 37 --yes-i-really-mean-it marked osd lost in epoch 115314 [root@cc1 ~]# ceph osd lost 72 --yes-i-really-mean-it marked osd lost in epoch 115317 [root@cc1 ~]# ceph -s cluster 8cdfbff9-b7be-46de-85bd-9d49866fcf60 health HEALTH_WARN 2 pgs down 2 pgs peering 2 pgs stuck inactive monmap e1: 3 mons at {cc1=192.168.128.1:6789/0,cc2=192.168.128.2:6789/0,cc3=192.168.128.3:6789/0} election epoch 872, quorum 0,1,2 cc1,cc2,cc3 osdmap e115434: 100 osds: 89 up, 86 in; 1 remapped pgs pgmap v67642483: 4032 pgs, 18 pools, 26713 GB data, 4857 kobjects 76718 GB used, 107 TB / 182 TB avail 4030 active+clean 1 down+remapped+peering 1 down+peering client io 14624 kB/s rd, 31619 kB/s wr, 382 op/s rd, 228 op/s wr [root@cc1 ~]# ceph -s cluster 8cdfbff9-b7be-46de-85bd-9d49866fcf60 health HEALTH_WARN 2 pgs down 2 pgs peering 2 pgs stuck inactive monmap e1: 3 mons at {cc1=192.168.128.1:6789/0,cc2=192.168.128.2:6789/0,cc3=192.168.128.3:6789/0} election epoch 872, quorum 0,1,2 cc1,cc2,cc3 osdmap e115434: 100 osds: 89 up, 86 in; 1 remapped pgs pgmap v67642485: 4032 pgs, 18 pools, 26713 GB data, 4857 kobjects 76718 GB used, 107 TB / 182 TB avail 4030 active+clean 1 down+remapped+peering 1 down+peering client io 17805 kB/s rd, 18787 kB/s wr, 215 op/s rd, 107 op/s wr >> >> > sage >> >> What information I can bring to You to say it is recoverable ? >> >> here are ceph -s and ceph health detail: >> >> [root@cc1 ~]# ceph -s >> cluster 8cdfbff9-b7be-46de-85bd-9d49866fcf60 >> health HEALTH_WARN >> 2 pgs down >> 2 pgs peering >> 2 pgs stuck inactive >> monmap e1: 3 mons at {cc1=192.168.128.1:6789/0,cc2=192.168.128.2:6789/0,cc3=192.168.128.3:6789/0} >> election epoch 872, quorum 0,1,2 cc1,cc2,cc3 >> osdmap e115431: 100 osds: 89 up, 86 in; 1 remapped pgs >> pgmap v67641261: 4032 pgs, 18 pools, 26706 GB data, 4855 kobjects >> 76705 GB used, 107 TB / 182 TB avail >> 4030 active+clean >> 1 down+remapped+peering >> 1 down+peering >> client io 5704 kB/s rd, 24685 kB/s wr, 49 op/s rd, 165 op/s wr >> [root@cc1 ~]# ceph health detail >> HEALTH_WARN 2 pgs down; 2 pgs peering; 2 pgs stuck inactive >> pg 1.165 is stuck inactive since forever, current state down+peering, last acting [67,88,48] >> pg 1.60 is stuck inactive since forever, current state down+remapped+peering, last acting [66,40] >> pg 1.60 is down+remapped+peering, acting [66,40] >> pg 1.165 is down+peering, acting [67,88,48] >> [root@cc1 ~]# >> >> -- >> Regards, >> Łukasz Chrustek >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- Pozdrowienia, Łukasz Chrustek -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html