strange unfounding of PGs

Csaba Tóth <i3rendszerhaz@xxxxxxxxx> · Mon, 13 Jun 2016 15:16:49 +0000

Hi!
I have a soo strange problem. At friday night i upgraded my small ceph cluster from hammer to jewel. Everything went so well, but the chowning of osd datadir took a lot time, so i skipped two osd and do the run-as-root trick. Yesterday evening i wanted to fix this, shutted down the first OSD and chowned the lib/ceph dir. But when i started it back a lot of strange pg not found error happened (this is just a small list):

2016-06-12 23:43:05.096078 osd.2 [ERR] 5.3d has 2 objects unfound and apparently lost
2016-06-12 23:43:05.096915 osd.2 [ERR] 5.30 has 1 objects unfound and apparently lost
2016-06-12 23:43:05.097702 osd.2 [ERR] 5.39 has 4 objects unfound and apparently lost
2016-06-12 23:43:05.100449 osd.2 [ERR] 5.2f has 1 objects unfound and apparently lost
2016-06-12 23:43:05.104519 osd.2 [ERR] 1.8 has 2 objects unfound and apparently lost
2016-06-12 23:43:05.106041 osd.2 [ERR] 5.3f has 1 objects unfound and apparently lost
2016-06-12 23:43:05.107379 osd.2 [ERR] 1.76 has 2 objects unfound and apparently lost
2016-06-12 23:43:05.107630 osd.2 [ERR] 1.0 has 1 objects unfound and apparently lost
2016-06-12 23:43:05.107661 osd.2 [ERR] 2.14 has 2 objects unfound and apparently lost
2016-06-12 23:43:05.107722 osd.2 [ERR] 2.3 has 1 objects unfound and apparently lost
2016-06-12 23:43:05.108082 osd.2 [ERR] 5.16 has 1 objects unfound and apparently lost
2016-06-12 23:43:05.108417 osd.2 [ERR] 5.38 has 2 objects unfound and apparently lost
2016-06-12 23:43:05.108910 osd.2 [ERR] 1.43 has 3 objects unfound and apparently lost
2016-06-12 23:43:05.109561 osd.2 [ERR] 1.a has 1 objects unfound and apparently lost
2016-06-12 23:43:05.110299 osd.2 [ERR] 1.10 has 1 objects unfound and apparently lost
2016-06-12 23:43:05.111781 osd.2 [ERR] 1.22 has 1 objects unfound and apparently lost
2016-06-12 23:43:05.111869 osd.2 [ERR] 1.1a has 3 objects unfound and apparently lost
2016-06-12 23:43:05.205688 osd.4 [ERR] 1.29 has 2 objects unfound and apparently lost
2016-06-12 23:43:05.206016 osd.4 [ERR] 1.1c has 1 objects unfound and apparently lost
2016-06-12 23:43:05.206219 osd.4 [ERR] 5.24 has 1 objects unfound and apparently lost
2016-06-12 23:43:05.209013 osd.4 [ERR] 1.6a has 1 objects unfound and apparently lost
2016-06-12 23:43:05.209421 osd.4 [ERR] 1.68 has 1 objects unfound and apparently lost
2016-06-12 23:43:05.209597 osd.4 [ERR] 5.d has 3 objects unfound and apparently lost
2016-06-12 23:43:05.209620 osd.4 [ERR] 1.9 has 1 objects unfound and apparently lost
2016-06-12 23:43:05.210191 osd.4 [ERR] 5.62 has 1 objects unfound and apparently lost
2016-06-12 23:43:05.210649 osd.4 [ERR] 2.57 has 1 objects unfound and apparently lost
2016-06-12 23:43:05.212011 osd.4 [ERR] 1.6 has 1 objects unfound and apparently lost
2016-06-12 23:43:05.212106 osd.4 [ERR] 2.b has 1 objects unfound and apparently lost
2016-06-12 23:43:05.212212 osd.4 [ERR] 5.8 has 1 objects unfound and apparently lost
2016-06-12 23:43:05.215850 osd.4 [ERR] 2.56 has 2 objects unfound and apparently lost

After this error messages i see this ceph health:
2016-06-12 23:44:10.498613 7f5941e0f700  0 log_channel(cluster) log [INF] : pgmap v23122505: 820 pgs: 1 peering, 37 active+degraded, 5 active+remapped+wait_backfill, 167 active+recovery_wait+degraded, 1 active+remapped, 1 active+recovering+degraded, 13 active+undersized+degraded+remapped+wait_backfill, 595 active+clean; 795 GB data, 1926 GB used, 5512 GB / 7438 GB avail; 7695 B/s wr, 2 op/s; 24459/3225218 objects degraded (0.758%); 44435/3225218 objects misplaced (1.378%); 346/1231022 unfound (0.028%)

Some minutes later it stalled in this state:
2016-06-13 00:07:32.761265 7f5941e0f700  0 log_channel(cluster) log [INF] : pgmap v23123311: 820 pgs: 1 active+recovery_wait+undersized+degraded+remapped, 1 active+recovering+degraded, 11 active+undersized+degraded+remapped+wait_backfill, 5 active+remapped+wait_backfill, 207 active+recovery_wait+degraded, 595 active+clean; 795 GB data, 1878 GB used, 5559 GB / 7438 GB avail; 14164 B/s wr, 3 op/s; 22562/3223912 objects degraded (0.700%); 38738/3223912 objects misplaced (1.202%); 566/1231222 unfound (0.046%)

But if i shut that OSD down i see this health (actually ceph stall in this state and do nothing):
2016-06-13 16:47:59.033552 mon.0 [INF] pgmap v23153361: 820 pgs: 32 active+recovery_wait+degraded, 1 active+recovering+degraded, 402 active+undersized+degraded+remapped+wait_backfill, 385 active+clean; 796 GB data, 1420 GB used, 4160 GB / 5581 GB avail; 10110 B/s rd, 1098 kB/s wr, 253 op/s; 692323/3215439 objects degraded (21.531%); 684099/3215439 objects misplaced (21.275%); 2/1231399 unfound (0.000%)

So i kept shutted down that OSD... that way my cluster has only 2 unfound object...

There are much more unfound objects when the OSD is up than if i shut it down. I don't understand this, please help me what to do to fix this.
Actually every RBD is reachable (but one virtual host crashed during the night), but some object in my CephFS starts to be unavailable

I read about the ceph-objectsore-tool and look if i can fix anything, here its an output of fix-lost operation, if helps:
root@c22:/var/lib/ceph# sudo -u ceph ceph-objectstore-tool --op fix-lost --dry-run --data-path /var/lib/ceph/osd/ceph-0
Error getting attr on : 1.48_head,#-3:12000000:::scrub_1.48:head#, (61) No data available
Error getting attr on : 1.79_head,#-3:9e000000:::scrub_1.79:head#, (61) No data available
Error getting attr on : 2.53_head,#-4:ca000000:::scrub_2.53:head#, (61) No data available
Error getting attr on : 2.6b_head,#-4:d6000000:::scrub_2.6b:head#, (61) No data available
Error getting attr on : 2.73_head,#-4:ce000000:::scrub_2.73:head#, (61) No data available
Error getting attr on : 4.16_head,#-6:68000000:::scrub_4.16:head#, (61) No data available
Error getting attr on : 4.2d_head,#-6:b4000000:::scrub_4.2d:head#, (61) No data available
Error getting attr on : 4.55_head,#-6:aa000000:::scrub_4.55:head#, (61) No data available
Error getting attr on : 4.57_head,#-6:ea000000:::scrub_4.57:head#, (61) No data available
Error getting attr on : 6.17_head,#-8:e8000000:::scrub_6.17:head#, (61) No data available
Error getting attr on : 6.46_head,#-8:62000000:::scrub_6.46:head#, (61) No data available
Error getting attr on : 6.53_head,#-8:ca000000:::scrub_6.53:head#, (61) No data available
Error getting attr on : 6.62_head,#-8:46000000:::scrub_6.62:head#, (61) No data available
dry-run: Nothing changed

originally i wanted to run the "ceph-objectstore-tool --op filestore-repair-orphan-links" what Sam suggested, but the latest 9.2.1 ceph didn't contained that.

Thanks in advance!
Csaba

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com