On Wed, 24 May 2017, Łukasz Chrustek wrote: > Cześć, > > > On Wed, 24 May 2017, Łukasz Chrustek wrote: > >> Cześć, > >> > >> > On Tue, 23 May 2017, Łukasz Chrustek wrote: > >> >> Cześć, > >> >> > >> >> > On Tue, 23 May 2017, Łukasz Chrustek wrote: > >> >> >> I'm not sleeping for over 30 hours, and still can't find solution. I > >> >> >> did, as You wrote, but turning off this > >> >> >> (https://pastebin.com/1npBXeMV) osds didn't resolve issue... > >> >> > >> >> > The important bit is: > >> >> > >> >> > "blocked": "peering is blocked due to down osds", > >> >> > "down_osds_we_would_probe": [ > >> >> > 6, > >> >> > 10, > >> >> > 33, > >> >> > 37, > >> >> > 72 > >> >> > ], > >> >> > "peering_blocked_by": [ > >> >> > { > >> >> > "osd": 6, > >> >> > "current_lost_at": 0, > >> >> > "comment": "starting or marking this osd lost may let > >> >> > us proceed" > >> >> > }, > >> >> > { > >> >> > "osd": 10, > >> >> > "current_lost_at": 0, > >> >> > "comment": "starting or marking this osd lost may let > >> >> > us proceed" > >> >> > }, > >> >> > { > >> >> > "osd": 37, > >> >> > "current_lost_at": 0, > >> >> > "comment": "starting or marking this osd lost may let > >> >> > us proceed" > >> >> > }, > >> >> > { > >> >> > "osd": 72, > >> >> > "current_lost_at": 113771, > >> >> > "comment": "starting or marking this osd lost may let > >> >> > us proceed" These are the osds (6, 10, 37, 72). > >> >> > } > >> >> > ] > >> >> > }, > >> >> > >> >> > Are any of those OSDs startable? This > >> >> > >> >> They were all up and running - but I decided to shut them down and out > >> >> them from ceph, now it looks like ceph working ok, but still two PGs > >> >> are in down state, how to get rid of it ? > >> > >> > If you haven't deleted the data, you should start the OSDs back up. This > >> > >> > If they are partially damanged you can use ceph-objectstore-tool to > >> > extract just the PGs in question to make sure you haven't lost anything, > >> > inject them on some other OSD(s) and restart those, and *then* mark the > >> > bad OSDs as 'lost'. > >> > >> > If all else fails, you can just mark those OSDs 'lost', but in doing so > >> > you might be telling the cluster to lose data. > >> > >> > The best thing to do is definitely to get those OSDs started again. This > >> > >> Now situation looks like this: > >> > >> [root@cc1 ~]# rbd info volumes/volume-ccc5d976-cecf-4938-a452-1bee6188987b > >> rbd image 'volume-ccc5d976-cecf-4938-a452-1bee6188987b': > >> size 500 GB in 128000 objects > >> order 22 (4096 kB objects) > >> block_name_prefix: rbd_data.ed9d394a851426 > >> format: 2 > >> features: layering > >> flags: > >> > >> [root@cc1 ~]# rados -p volumes ls | grep rbd_data.ed9d394a851426 > >> (output cutted) > >> rbd_data.ed9d394a851426.000000000000447c > >> rbd_data.ed9d394a851426.0000000000010857 > >> rbd_data.ed9d394a851426.000000000000ec8b > >> rbd_data.ed9d394a851426.000000000000fa43 > >> rbd_data.ed9d394a851426.000000000001ef2d > >> ^C > >> > >> it hangs on this object and isn't going further. rbd cp also hangs... > >> rbd map - also... > >> > >> can You advice what can be solution for this case ? > > > The hang is due to OSD throttling (see my first reply for how to wrok > > around that and get a pg query). But you already did that and the cluster > > told you which OSDs it needs to see up in order for it to peer and > > recover. If you haven't destroyed those disks, you should start those > > osds and it shoudl be fine. If you've destroyed the data or the disks are > > truly broken and dead, then you can mark those OSDs lost and the cluster > > *maybe* recover (but hard to say given the information you've shared). This > > > sage > > What information I can bring to You to say it is recoverable ? > > here are ceph -s and ceph health detail: > > [root@cc1 ~]# ceph -s > cluster 8cdfbff9-b7be-46de-85bd-9d49866fcf60 > health HEALTH_WARN > 2 pgs down > 2 pgs peering > 2 pgs stuck inactive > monmap e1: 3 mons at {cc1=192.168.128.1:6789/0,cc2=192.168.128.2:6789/0,cc3=192.168.128.3:6789/0} > election epoch 872, quorum 0,1,2 cc1,cc2,cc3 > osdmap e115431: 100 osds: 89 up, 86 in; 1 remapped pgs > pgmap v67641261: 4032 pgs, 18 pools, 26706 GB data, 4855 kobjects > 76705 GB used, 107 TB / 182 TB avail > 4030 active+clean > 1 down+remapped+peering > 1 down+peering > client io 5704 kB/s rd, 24685 kB/s wr, 49 op/s rd, 165 op/s wr > [root@cc1 ~]# ceph health detail > HEALTH_WARN 2 pgs down; 2 pgs peering; 2 pgs stuck inactive > pg 1.165 is stuck inactive since forever, current state down+peering, last acting [67,88,48] > pg 1.60 is stuck inactive since forever, current state down+remapped+peering, last acting [66,40] > pg 1.60 is down+remapped+peering, acting [66,40] > pg 1.165 is down+peering, acting [67,88,48] > [root@cc1 ~]# > > -- > Regards, > Łukasz Chrustek > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > >