Did you enable the sortbitwise flag as per the upgrade instructions, as there is a known bug with it? I don't know why these instructions haven't been amended in light of this bug. http://tracker.ceph.com/issues/16113 > -----Original Message----- > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of > Csaba Tóth > Sent: 13 June 2016 16:17 > To: ceph-users@xxxxxxxx > Subject: strange unfounding of PGs > > Hi! > > I have a soo strange problem. At friday night i upgraded my small ceph cluster > from hammer to jewel. Everything went so well, but the chowning of osd > datadir took a lot time, so i skipped two osd and do the run-as-root trick. > Yesterday evening i wanted to fix this, shutted down the first OSD and > chowned the lib/ceph dir. But when i started it back a lot of strange pg not > found error happened (this is just a small list): > > 2016-06-12 23:43:05.096078 osd.2 [ERR] 5.3d has 2 objects unfound and > apparently lost > 2016-06-12 23:43:05.096915 osd.2 [ERR] 5.30 has 1 objects unfound and > apparently lost > 2016-06-12 23:43:05.097702 osd.2 [ERR] 5.39 has 4 objects unfound and > apparently lost > 2016-06-12 23:43:05.100449 osd.2 [ERR] 5.2f has 1 objects unfound and > apparently lost > 2016-06-12 23:43:05.104519 osd.2 [ERR] 1.8 has 2 objects unfound and > apparently lost > 2016-06-12 23:43:05.106041 osd.2 [ERR] 5.3f has 1 objects unfound and > apparently lost > 2016-06-12 23:43:05.107379 osd.2 [ERR] 1.76 has 2 objects unfound and > apparently lost > 2016-06-12 23:43:05.107630 osd.2 [ERR] 1.0 has 1 objects unfound and > apparently lost > 2016-06-12 23:43:05.107661 osd.2 [ERR] 2.14 has 2 objects unfound and > apparently lost > 2016-06-12 23:43:05.107722 osd.2 [ERR] 2.3 has 1 objects unfound and > apparently lost > 2016-06-12 23:43:05.108082 osd.2 [ERR] 5.16 has 1 objects unfound and > apparently lost > 2016-06-12 23:43:05.108417 osd.2 [ERR] 5.38 has 2 objects unfound and > apparently lost > 2016-06-12 23:43:05.108910 osd.2 [ERR] 1.43 has 3 objects unfound and > apparently lost > 2016-06-12 23:43:05.109561 osd.2 [ERR] 1.a has 1 objects unfound and > apparently lost > 2016-06-12 23:43:05.110299 osd.2 [ERR] 1.10 has 1 objects unfound and > apparently lost > 2016-06-12 23:43:05.111781 osd.2 [ERR] 1.22 has 1 objects unfound and > apparently lost > 2016-06-12 23:43:05.111869 osd.2 [ERR] 1.1a has 3 objects unfound and > apparently lost > 2016-06-12 23:43:05.205688 osd.4 [ERR] 1.29 has 2 objects unfound and > apparently lost > 2016-06-12 23:43:05.206016 osd.4 [ERR] 1.1c has 1 objects unfound and > apparently lost > 2016-06-12 23:43:05.206219 osd.4 [ERR] 5.24 has 1 objects unfound and > apparently lost > 2016-06-12 23:43:05.209013 osd.4 [ERR] 1.6a has 1 objects unfound and > apparently lost > 2016-06-12 23:43:05.209421 osd.4 [ERR] 1.68 has 1 objects unfound and > apparently lost > 2016-06-12 23:43:05.209597 osd.4 [ERR] 5.d has 3 objects unfound and > apparently lost > 2016-06-12 23:43:05.209620 osd.4 [ERR] 1.9 has 1 objects unfound and > apparently lost > 2016-06-12 23:43:05.210191 osd.4 [ERR] 5.62 has 1 objects unfound and > apparently lost > 2016-06-12 23:43:05.210649 osd.4 [ERR] 2.57 has 1 objects unfound and > apparently lost > 2016-06-12 23:43:05.212011 osd.4 [ERR] 1.6 has 1 objects unfound and > apparently lost > 2016-06-12 23:43:05.212106 osd.4 [ERR] 2.b has 1 objects unfound and > apparently lost > 2016-06-12 23:43:05.212212 osd.4 [ERR] 5.8 has 1 objects unfound and > apparently lost > 2016-06-12 23:43:05.215850 osd.4 [ERR] 2.56 has 2 objects unfound and > apparently lost > > > After this error messages i see this ceph health: > 2016-06-12 23:44:10.498613 7f5941e0f700 0 log_channel(cluster) log [INF] : > pgmap v23122505: 820 pgs: 1 peering, 37 active+degraded, 5 > active+remapped+wait_backfill, 167 active+recovery_wait+degraded, 1 > active+remapped, 1 active+recovering+degraded, 13 > active+undersized+degraded+remapped+wait_backfill, 595 active+clean; 795 > GB data, 1926 GB used, 5512 GB / 7438 GB avail; 7695 B/s wr, 2 op/s; > 24459/3225218 objects degraded (0.758%); 44435/3225218 objects misplaced > (1.378%); 346/1231022 unfound (0.028%) > > Some minutes later it stalled in this state: > 2016-06-13 00:07:32.761265 7f5941e0f700 0 log_channel(cluster) log [INF] : > pgmap v23123311: 820 pgs: 1 > active+recovery_wait+undersized+degraded+remapped, 1 > active+recovering+degraded, 11 > active+undersized+degraded+remapped+wait_backfill, 5 > active+remapped+wait_backfill, 207 active+recovery_wait+degraded, 595 > active+clean; 795 GB data, 1878 GB used, 5559 GB / 7438 GB avail; 14164 B/s > wr, 3 op/s; 22562/3223912 objects degraded (0.700%); 38738/3223912 objects > misplaced (1.202%); 566/1231222 unfound (0.046%) > > But if i shut that OSD down i see this health (actually ceph stall in this state > and do nothing): > 2016-06-13 16:47:59.033552 mon.0 [INF] pgmap v23153361: 820 pgs: 32 > active+recovery_wait+degraded, 1 active+recovering+degraded, 402 > active+undersized+degraded+remapped+wait_backfill, 385 active+clean; 796 > GB data, 1420 GB used, 4160 GB / 5581 GB avail; 10110 B/s rd, 1098 kB/s wr, > 253 op/s; 692323/3215439 objects degraded (21.531%); 684099/3215439 > objects misplaced (21.275%); 2/1231399 unfound (0.000%) > > So i kept shutted down that OSD... that way my cluster has only 2 unfound > object... > > There are much more unfound objects when the OSD is up than if i shut it > down. I don't understand this, please help me what to do to fix this. > Actually every RBD is reachable (but one virtual host crashed during the > night), but some object in my CephFS starts to be unavailable > > I read about the ceph-objectsore-tool and look if i can fix anything, here its > an output of fix-lost operation, if helps: > root@c22:/var/lib/ceph# sudo -u ceph ceph-objectstore-tool --op fix-lost -- > dry-run --data-path /var/lib/ceph/osd/ceph-0 > Error getting attr on : 1.48_head,#-3:12000000:::scrub_1.48:head#, (61) No > data available > Error getting attr on : 1.79_head,#-3:9e000000:::scrub_1.79:head#, (61) No > data available > Error getting attr on : 2.53_head,#-4:ca000000:::scrub_2.53:head#, (61) No > data available > Error getting attr on : 2.6b_head,#-4:d6000000:::scrub_2.6b:head#, (61) No > data available > Error getting attr on : 2.73_head,#-4:ce000000:::scrub_2.73:head#, (61) No > data available > Error getting attr on : 4.16_head,#-6:68000000:::scrub_4.16:head#, (61) No > data available > Error getting attr on : 4.2d_head,#-6:b4000000:::scrub_4.2d:head#, (61) No > data available > Error getting attr on : 4.55_head,#-6:aa000000:::scrub_4.55:head#, (61) No > data available > Error getting attr on : 4.57_head,#-6:ea000000:::scrub_4.57:head#, (61) No > data available > Error getting attr on : 6.17_head,#-8:e8000000:::scrub_6.17:head#, (61) No > data available > Error getting attr on : 6.46_head,#-8:62000000:::scrub_6.46:head#, (61) No > data available > Error getting attr on : 6.53_head,#-8:ca000000:::scrub_6.53:head#, (61) No > data available > Error getting attr on : 6.62_head,#-8:46000000:::scrub_6.62:head#, (61) No > data available > dry-run: Nothing changed > > originally i wanted to run the "ceph-objectstore-tool --op filestore-repair- > orphan-links" what Sam suggested, but the latest 9.2.1 ceph didn't contained > that. > > Thanks in advance! > Csaba _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com