Re: strange unfounding of PGs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Yes!
After i read the mail i unsetted it immadiately, and now the recovery process started to continue.
After i switched back my off keeped OSD ceph founded the unfounded objects, and now the recovery process runs.

Thanks Nick and Christian, you saved me! :)


Christian Balzer <chibi@xxxxxxx> ezt írta (időpont: 2016. jún. 14., K, 9:24):
On Tue, 14 Jun 2016 07:09:45 +0000 Csaba Tóth wrote:

> Hi Nick!
> Yes i did. :(
> Do you know how can i fix it?
>
>
Supposedly just by un-setting it:
https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg29651.html

Christian

> Nick Fisk <nick@xxxxxxxxxx> ezt írta (időpont: 2016. jún. 14., K, 7:52):
>
> > Did you enable the sortbitwise flag as per the upgrade instructions, as
> > there is a known bug with it? I don't know why these instructions
> > haven't been amended in light of this bug.
> >
> > http://tracker.ceph.com/issues/16113
> >
> >
> >
> > > -----Original Message-----
> > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> > > Behalf Of Csaba Tóth
> > > Sent: 13 June 2016 16:17
> > > To: ceph-users@xxxxxxxx
> > > Subject: strange unfounding of PGs
> > >
> > > Hi!
> > >
> > > I have a soo strange problem. At friday night i upgraded my small
> > > ceph
> > cluster
> > > from hammer to jewel. Everything went so well, but the chowning of
> > > osd datadir took a lot time, so i skipped two osd and do the
> > > run-as-root
> > trick.
> > > Yesterday evening i wanted to fix this, shutted down the first OSD
> > > and chowned the lib/ceph dir. But when i started it back a lot of
> > > strange pg
> > not
> > > found error happened (this is just a small list):
> > >
> > > 2016-06-12 23:43:05.096078 osd.2 [ERR] 5.3d has 2 objects unfound and
> > > apparently lost
> > > 2016-06-12 23:43:05.096915 osd.2 [ERR] 5.30 has 1 objects unfound and
> > > apparently lost
> > > 2016-06-12 23:43:05.097702 osd.2 [ERR] 5.39 has 4 objects unfound and
> > > apparently lost
> > > 2016-06-12 23:43:05.100449 osd.2 [ERR] 5.2f has 1 objects unfound and
> > > apparently lost
> > > 2016-06-12 23:43:05.104519 osd.2 [ERR] 1.8 has 2 objects unfound and
> > > apparently lost
> > > 2016-06-12 23:43:05.106041 osd.2 [ERR] 5.3f has 1 objects unfound and
> > > apparently lost
> > > 2016-06-12 23:43:05.107379 osd.2 [ERR] 1.76 has 2 objects unfound and
> > > apparently lost
> > > 2016-06-12 23:43:05.107630 osd.2 [ERR] 1.0 has 1 objects unfound and
> > > apparently lost
> > > 2016-06-12 23:43:05.107661 osd.2 [ERR] 2.14 has 2 objects unfound and
> > > apparently lost
> > > 2016-06-12 23:43:05.107722 osd.2 [ERR] 2.3 has 1 objects unfound and
> > > apparently lost
> > > 2016-06-12 23:43:05.108082 osd.2 [ERR] 5.16 has 1 objects unfound and
> > > apparently lost
> > > 2016-06-12 23:43:05.108417 osd.2 [ERR] 5.38 has 2 objects unfound and
> > > apparently lost
> > > 2016-06-12 23:43:05.108910 osd.2 [ERR] 1.43 has 3 objects unfound and
> > > apparently lost
> > > 2016-06-12 23:43:05.109561 osd.2 [ERR] 1.a has 1 objects unfound and
> > > apparently lost
> > > 2016-06-12 23:43:05.110299 osd.2 [ERR] 1.10 has 1 objects unfound and
> > > apparently lost
> > > 2016-06-12 23:43:05.111781 osd.2 [ERR] 1.22 has 1 objects unfound and
> > > apparently lost
> > > 2016-06-12 23:43:05.111869 osd.2 [ERR] 1.1a has 3 objects unfound and
> > > apparently lost
> > > 2016-06-12 23:43:05.205688 osd.4 [ERR] 1.29 has 2 objects unfound and
> > > apparently lost
> > > 2016-06-12 23:43:05.206016 osd.4 [ERR] 1.1c has 1 objects unfound and
> > > apparently lost
> > > 2016-06-12 23:43:05.206219 osd.4 [ERR] 5.24 has 1 objects unfound and
> > > apparently lost
> > > 2016-06-12 23:43:05.209013 osd.4 [ERR] 1.6a has 1 objects unfound and
> > > apparently lost
> > > 2016-06-12 23:43:05.209421 osd.4 [ERR] 1.68 has 1 objects unfound and
> > > apparently lost
> > > 2016-06-12 23:43:05.209597 osd.4 [ERR] 5.d has 3 objects unfound and
> > > apparently lost
> > > 2016-06-12 23:43:05.209620 osd.4 [ERR] 1.9 has 1 objects unfound and
> > > apparently lost
> > > 2016-06-12 23:43:05.210191 osd.4 [ERR] 5.62 has 1 objects unfound and
> > > apparently lost
> > > 2016-06-12 23:43:05.210649 osd.4 [ERR] 2.57 has 1 objects unfound and
> > > apparently lost
> > > 2016-06-12 23:43:05.212011 osd.4 [ERR] 1.6 has 1 objects unfound and
> > > apparently lost
> > > 2016-06-12 23:43:05.212106 osd.4 [ERR] 2.b has 1 objects unfound and
> > > apparently lost
> > > 2016-06-12 23:43:05.212212 osd.4 [ERR] 5.8 has 1 objects unfound and
> > > apparently lost
> > > 2016-06-12 23:43:05.215850 osd.4 [ERR] 2.56 has 2 objects unfound and
> > > apparently lost
> > >
> > >
> > > After this error messages i see this ceph health:
> > > 2016-06-12 23:44:10.498613 7f5941e0f700  0 log_channel(cluster) log
> > [INF] :
> > > pgmap v23122505: 820 pgs: 1 peering, 37 active+degraded, 5
> > > active+remapped+wait_backfill, 167 active+recovery_wait+degraded, 1
> > > active+remapped, 1 active+recovering+degraded, 13
> > > active+undersized+degraded+remapped+wait_backfill, 595 active+clean;
> > > 795 GB data, 1926 GB used, 5512 GB / 7438 GB avail; 7695 B/s wr, 2
> > > op/s; 24459/3225218 objects degraded (0.758%); 44435/3225218 objects
> > > misplaced (1.378%); 346/1231022 unfound (0.028%)
> > >
> > > Some minutes later it stalled in this state:
> > > 2016-06-13 00:07:32.761265 7f5941e0f700  0 log_channel(cluster) log
> > [INF] :
> > > pgmap v23123311: 820 pgs: 1
> > > active+recovery_wait+undersized+degraded+remapped, 1
> > > active+recovering+degraded, 11
> > > active+undersized+degraded+remapped+wait_backfill, 5
> > > active+remapped+wait_backfill, 207 active+recovery_wait+degraded, 595
> > > active+clean; 795 GB data, 1878 GB used, 5559 GB / 7438 GB avail;
> > > 14164
> > B/s
> > > wr, 3 op/s; 22562/3223912 objects degraded (0.700%); 38738/3223912
> > objects
> > > misplaced (1.202%); 566/1231222 unfound (0.046%)
> > >
> > > But if i shut that OSD down i see this health (actually ceph stall in
> > this state
> > > and do nothing):
> > > 2016-06-13 16:47:59.033552 mon.0 [INF] pgmap v23153361: 820 pgs: 32
> > > active+recovery_wait+degraded, 1 active+recovering+degraded, 402
> > > active+undersized+degraded+remapped+wait_backfill, 385 active+clean;
> > > 796 GB data, 1420 GB used, 4160 GB / 5581 GB avail; 10110 B/s rd,
> > > 1098 kB/s
> > wr,
> > > 253 op/s; 692323/3215439 objects degraded (21.531%); 684099/3215439
> > > objects misplaced (21.275%); 2/1231399 unfound (0.000%)
> > >
> > > So i kept shutted down that OSD... that way my cluster has only 2
> > > unfound object...
> > >
> > > There are much more unfound objects when the OSD is up than if i
> > > shut it down. I don't understand this, please help me what to do to
> > > fix this. Actually every RBD is reachable (but one virtual host
> > > crashed during the night), but some object in my CephFS starts to be
> > > unavailable
> > >
> > > I read about the ceph-objectsore-tool and look if i can fix anything,
> > here its
> > > an output of fix-lost operation, if helps:
> > > root@c22:/var/lib/ceph# sudo -u ceph ceph-objectstore-tool --op
> > fix-lost --
> > > dry-run --data-path /var/lib/ceph/osd/ceph-0
> > > Error getting attr on : 1.48_head,#-3:12000000:::scrub_1.48:head#,
> > > (61)
> > No
> > > data available
> > > Error getting attr on : 1.79_head,#-3:9e000000:::scrub_1.79:head#,
> > > (61)
> > No
> > > data available
> > > Error getting attr on : 2.53_head,#-4:ca000000:::scrub_2.53:head#,
> > > (61)
> > No
> > > data available
> > > Error getting attr on : 2.6b_head,#-4:d6000000:::scrub_2.6b:head#,
> > > (61)
> > No
> > > data available
> > > Error getting attr on : 2.73_head,#-4:ce000000:::scrub_2.73:head#,
> > > (61)
> > No
> > > data available
> > > Error getting attr on : 4.16_head,#-6:68000000:::scrub_4.16:head#,
> > > (61)
> > No
> > > data available
> > > Error getting attr on : 4.2d_head,#-6:b4000000:::scrub_4.2d:head#,
> > > (61)
> > No
> > > data available
> > > Error getting attr on : 4.55_head,#-6:aa000000:::scrub_4.55:head#,
> > > (61)
> > No
> > > data available
> > > Error getting attr on : 4.57_head,#-6:ea000000:::scrub_4.57:head#,
> > > (61)
> > No
> > > data available
> > > Error getting attr on : 6.17_head,#-8:e8000000:::scrub_6.17:head#,
> > > (61)
> > No
> > > data available
> > > Error getting attr on : 6.46_head,#-8:62000000:::scrub_6.46:head#,
> > > (61)
> > No
> > > data available
> > > Error getting attr on : 6.53_head,#-8:ca000000:::scrub_6.53:head#,
> > > (61)
> > No
> > > data available
> > > Error getting attr on : 6.62_head,#-8:46000000:::scrub_6.62:head#,
> > > (61)
> > No
> > > data available
> > > dry-run: Nothing changed
> > >
> > > originally i wanted to run the "ceph-objectstore-tool --op
> > filestore-repair-
> > > orphan-links" what Sam suggested, but the latest 9.2.1 ceph didn't
> > contained
> > > that.
> > >
> > > Thanks in advance!
> > > Csaba
> >
> >
> >


--
Christian Balzer        Network/Systems Engineer
chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux