Re: strange unfounding of PGs

Nick Fisk <nick@xxxxxxxxxx> · Tue, 14 Jun 2016 06:52:19 +0100

Did you enable the sortbitwise flag as per the upgrade instructions, as there is a known bug with it? I don't know why these instructions haven't been amended in light of this bug.

http://tracker.ceph.com/issues/16113

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Csaba Tóth
> Sent: 13 June 2016 16:17
> To: ceph-users@xxxxxxxx
> Subject:  strange unfounding of PGs
> 
> Hi!
> 
> I have a soo strange problem. At friday night i upgraded my small ceph cluster
> from hammer to jewel. Everything went so well, but the chowning of osd
> datadir took a lot time, so i skipped two osd and do the run-as-root trick.
> Yesterday evening i wanted to fix this, shutted down the first OSD and
> chowned the lib/ceph dir. But when i started it back a lot of strange pg not
> found error happened (this is just a small list):
> 
> 2016-06-12 23:43:05.096078 osd.2 [ERR] 5.3d has 2 objects unfound and
> apparently lost
> 2016-06-12 23:43:05.096915 osd.2 [ERR] 5.30 has 1 objects unfound and
> apparently lost
> 2016-06-12 23:43:05.097702 osd.2 [ERR] 5.39 has 4 objects unfound and
> apparently lost
> 2016-06-12 23:43:05.100449 osd.2 [ERR] 5.2f has 1 objects unfound and
> apparently lost
> 2016-06-12 23:43:05.104519 osd.2 [ERR] 1.8 has 2 objects unfound and
> apparently lost
> 2016-06-12 23:43:05.106041 osd.2 [ERR] 5.3f has 1 objects unfound and
> apparently lost
> 2016-06-12 23:43:05.107379 osd.2 [ERR] 1.76 has 2 objects unfound and
> apparently lost
> 2016-06-12 23:43:05.107630 osd.2 [ERR] 1.0 has 1 objects unfound and
> apparently lost
> 2016-06-12 23:43:05.107661 osd.2 [ERR] 2.14 has 2 objects unfound and
> apparently lost
> 2016-06-12 23:43:05.107722 osd.2 [ERR] 2.3 has 1 objects unfound and
> apparently lost
> 2016-06-12 23:43:05.108082 osd.2 [ERR] 5.16 has 1 objects unfound and
> apparently lost
> 2016-06-12 23:43:05.108417 osd.2 [ERR] 5.38 has 2 objects unfound and
> apparently lost
> 2016-06-12 23:43:05.108910 osd.2 [ERR] 1.43 has 3 objects unfound and
> apparently lost
> 2016-06-12 23:43:05.109561 osd.2 [ERR] 1.a has 1 objects unfound and
> apparently lost
> 2016-06-12 23:43:05.110299 osd.2 [ERR] 1.10 has 1 objects unfound and
> apparently lost
> 2016-06-12 23:43:05.111781 osd.2 [ERR] 1.22 has 1 objects unfound and
> apparently lost
> 2016-06-12 23:43:05.111869 osd.2 [ERR] 1.1a has 3 objects unfound and
> apparently lost
> 2016-06-12 23:43:05.205688 osd.4 [ERR] 1.29 has 2 objects unfound and
> apparently lost
> 2016-06-12 23:43:05.206016 osd.4 [ERR] 1.1c has 1 objects unfound and
> apparently lost
> 2016-06-12 23:43:05.206219 osd.4 [ERR] 5.24 has 1 objects unfound and
> apparently lost
> 2016-06-12 23:43:05.209013 osd.4 [ERR] 1.6a has 1 objects unfound and
> apparently lost
> 2016-06-12 23:43:05.209421 osd.4 [ERR] 1.68 has 1 objects unfound and
> apparently lost
> 2016-06-12 23:43:05.209597 osd.4 [ERR] 5.d has 3 objects unfound and
> apparently lost
> 2016-06-12 23:43:05.209620 osd.4 [ERR] 1.9 has 1 objects unfound and
> apparently lost
> 2016-06-12 23:43:05.210191 osd.4 [ERR] 5.62 has 1 objects unfound and
> apparently lost
> 2016-06-12 23:43:05.210649 osd.4 [ERR] 2.57 has 1 objects unfound and
> apparently lost
> 2016-06-12 23:43:05.212011 osd.4 [ERR] 1.6 has 1 objects unfound and
> apparently lost
> 2016-06-12 23:43:05.212106 osd.4 [ERR] 2.b has 1 objects unfound and
> apparently lost
> 2016-06-12 23:43:05.212212 osd.4 [ERR] 5.8 has 1 objects unfound and
> apparently lost
> 2016-06-12 23:43:05.215850 osd.4 [ERR] 2.56 has 2 objects unfound and
> apparently lost
> 
> 
> After this error messages i see this ceph health:
> 2016-06-12 23:44:10.498613 7f5941e0f700  0 log_channel(cluster) log [INF] :
> pgmap v23122505: 820 pgs: 1 peering, 37 active+degraded, 5
> active+remapped+wait_backfill, 167 active+recovery_wait+degraded, 1
> active+remapped, 1 active+recovering+degraded, 13
> active+undersized+degraded+remapped+wait_backfill, 595 active+clean; 795
> GB data, 1926 GB used, 5512 GB / 7438 GB avail; 7695 B/s wr, 2 op/s;
> 24459/3225218 objects degraded (0.758%); 44435/3225218 objects misplaced
> (1.378%); 346/1231022 unfound (0.028%)
> 
> Some minutes later it stalled in this state:
> 2016-06-13 00:07:32.761265 7f5941e0f700  0 log_channel(cluster) log [INF] :
> pgmap v23123311: 820 pgs: 1
> active+recovery_wait+undersized+degraded+remapped, 1
> active+recovering+degraded, 11
> active+undersized+degraded+remapped+wait_backfill, 5
> active+remapped+wait_backfill, 207 active+recovery_wait+degraded, 595
> active+clean; 795 GB data, 1878 GB used, 5559 GB / 7438 GB avail; 14164 B/s
> wr, 3 op/s; 22562/3223912 objects degraded (0.700%); 38738/3223912 objects
> misplaced (1.202%); 566/1231222 unfound (0.046%)
> 
> But if i shut that OSD down i see this health (actually ceph stall in this state
> and do nothing):
> 2016-06-13 16:47:59.033552 mon.0 [INF] pgmap v23153361: 820 pgs: 32
> active+recovery_wait+degraded, 1 active+recovering+degraded, 402
> active+undersized+degraded+remapped+wait_backfill, 385 active+clean; 796
> GB data, 1420 GB used, 4160 GB / 5581 GB avail; 10110 B/s rd, 1098 kB/s wr,
> 253 op/s; 692323/3215439 objects degraded (21.531%); 684099/3215439
> objects misplaced (21.275%); 2/1231399 unfound (0.000%)
> 
> So i kept shutted down that OSD... that way my cluster has only 2 unfound
> object...
> 
> There are much more unfound objects when the OSD is up than if i shut it
> down. I don't understand this, please help me what to do to fix this.
> Actually every RBD is reachable (but one virtual host crashed during the
> night), but some object in my CephFS starts to be unavailable
> 
> I read about the ceph-objectsore-tool and look if i can fix anything, here its
> an output of fix-lost operation, if helps:
> root@c22:/var/lib/ceph# sudo -u ceph ceph-objectstore-tool --op fix-lost --
> dry-run --data-path /var/lib/ceph/osd/ceph-0
> Error getting attr on : 1.48_head,#-3:12000000:::scrub_1.48:head#, (61) No
> data available
> Error getting attr on : 1.79_head,#-3:9e000000:::scrub_1.79:head#, (61) No
> data available
> Error getting attr on : 2.53_head,#-4:ca000000:::scrub_2.53:head#, (61) No
> data available
> Error getting attr on : 2.6b_head,#-4:d6000000:::scrub_2.6b:head#, (61) No
> data available
> Error getting attr on : 2.73_head,#-4:ce000000:::scrub_2.73:head#, (61) No
> data available
> Error getting attr on : 4.16_head,#-6:68000000:::scrub_4.16:head#, (61) No
> data available
> Error getting attr on : 4.2d_head,#-6:b4000000:::scrub_4.2d:head#, (61) No
> data available
> Error getting attr on : 4.55_head,#-6:aa000000:::scrub_4.55:head#, (61) No
> data available
> Error getting attr on : 4.57_head,#-6:ea000000:::scrub_4.57:head#, (61) No
> data available
> Error getting attr on : 6.17_head,#-8:e8000000:::scrub_6.17:head#, (61) No
> data available
> Error getting attr on : 6.46_head,#-8:62000000:::scrub_6.46:head#, (61) No
> data available
> Error getting attr on : 6.53_head,#-8:ca000000:::scrub_6.53:head#, (61) No
> data available
> Error getting attr on : 6.62_head,#-8:46000000:::scrub_6.62:head#, (61) No
> data available
> dry-run: Nothing changed
> 
> originally i wanted to run the "ceph-objectstore-tool --op filestore-repair-
> orphan-links" what Sam suggested, but the latest 9.2.1 ceph didn't contained
> that.
> 
> Thanks in advance!
> Csaba

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com