unfound objects - why and how to recover ? (bonus : jewel logs)

SCHAER Frederic <frederic.schaer@xxxxxx> · Fri, 27 May 2016 14:04:32 +0000

Hi,

--
First, let me start with the bonus…
I migrated from hammer => jewel and followed the migration instructions… but migrations instructions are missing this :
#chown  -R ceph:ceph /var/log/ceph
I just discoved this was the reason I found no log nowhere about my current issue :/
--

This is maybe the 3rd time this happens to me … This time I’d like to try to understand what happens.

So. ceph-10.2.0-0.el7.x86_64+Cent0S 7.2 here.
Ceph health was happy, but any rbd operation was hanging – hence : ceph was hung, and so were the test VMs running on it.

I placed my VM in an EC pool on top of which I overlayed an RBD pool with SSDs.
The EC pool is defined as being a 3+1 pool, with 5 hosts hosting the OSDs (and the failure domain is set to hosts)

“Ceph –w” wasn’t displaying new status lines as usual, but ceph health (detail) wasn’t saying anything would be wrong.
After looking at one node, I found that ceph logs were empty on one node, so I decided to restart the OSDs on that one using : systemctl restart ceph-osd@*

After I did that, ceph –w got to life again , but telling me there was a dead MON – which I restarted too.
I watched some kind of recovery happening, and after a few seconds/minutes, I now see :

[root@ceph0 ~]# ceph health detail
HEALTH_WARN 4 pgs degraded; 3 pgs recovering; 1 pgs recovery_wait; 4 pgs stuck unclean; recovery 57/373846 objects degraded (0.015%); recovery
 57/110920 unfound (0.051%)
pg 691.65 is stuck unclean for 310704.556119, current state active+recovery_wait+degraded, last acting [44,99,69,9]
pg 691.1e5 is stuck unclean for 493631.370697, current state active+recovering+degraded, last acting [77,43,20,99]
pg 691.12a is stuck unclean for 14521.475478, current state active+recovering+degraded, last acting [42,56,7,106]
pg 691.165 is stuck unclean for 14521.474525, current state active+recovering+degraded, last acting [21,71,24,117]
pg 691.165 is active+recovering+degraded, acting [21,71,24,117], 15 unfound
pg 691.12a is active+recovering+degraded, acting [42,56,7,106], 1 unfound
pg 691.1e5 is active+recovering+degraded, acting [77,43,20,99], 2 unfound
pg 691.65 is active+recovery_wait+degraded, acting [44,99,69,9], 39 unfound
recovery 57/373846 objects degraded (0.015%)
recovery 57/110920 unfound (0.051%)

Damn.
Last time this happened, I was forced to declare lost the PGs in order to recover a “healthy” ceph, because ceph does not want to revert PGs in EC pools. But one of the VMs started hanging randomly on disk IOs…
This same VM is now down, and I can’t remove its disk from rbd, it’s hanging at 99% - I could work that around by renaming the file and re-installing the VM on a new disk, but anyway, I’d like to understand+fix+make sure
 this does not happen again.
We sometimes suffer power cuts here : if restarting daemons kills ceph data, I cannot think of what would happen in case of power cut…

Back to the unfound objects. I have no OSD down that would be in the cluster (only 1 down, and I put it myself down – OSD.46 - , but set its weight to 0 last week)
I can query the PGs, but I don’t understand what I see in there.
For instance :

#ceph pg 691.65 query
(…)
                "num_objects_missing": 0,
                "num_objects_degraded": 39,
                "num_objects_misplaced": 0,
                "num_objects_unfound": 39,
                "num_objects_dirty": 138,

And then for 2 peers I see :
                "state": "active+undersized+degraded", ## undersized ???
(…)
                    "num_objects_missing": 0,
                    "num_objects_degraded": 138,
                    "num_objects_misplaced": 138,
                    "num_objects_unfound": 0,

"num_objects_dirty": 138,
                "blocked_by": [],
                "up_primary": 44,
                "acting_primary": 44

If I look at the “missing” objects, I can see something on some OSDs :
# ceph pg 691.165 list_missing
(…)
{
            "oid": {
                "oid": "rbd_data.8de32431bd7b7.0000000000000ea7",
                "key": "",
                "snapid": -2,
                "hash": 971513189,
                "max": 0,
                "pool": 691,
                "namespace": ""
            },
            "need": "26521'22595",
            "have": "25922'22575",
            "locations": []
        }

All of the missing objects have this “need/have” discrepancy.

I can see such objects in a “691.165” directory on secondary OSDs, but I do not see any 691.165 directory on the primary OSD (44)… ?
For instance :
[root@ceph0 ~]# ll /var/lib/ceph/osd/ceph-21/current/691.165s0_head/*8de32431bd7b7.0000000000000ea7*
-rw-r--r-- 1 ceph ceph 1399392 May 15 13:18 /var/lib/ceph/osd/ceph-21/current/691.165s0_head/rbd\udata.8de32431bd7b7.0000000000000ea7__head_39E81D65__2b3_5843_0
-rw-r--r-- 1 ceph ceph 1399392 May 27 11:07 /var/lib/ceph/osd/ceph-21/current/691.165s0_head/rbd\udata.8de32431bd7b7.0000000000000ea7__head_39E81D65__2b3_ffffffffffffffff_0

Even so : assuming I would have lost data on that OSD 44 (how ??), I would assume ceph would be able to reconstruct the missing data/PG thanks to the erasure codes/replica for RBD , it looks like it’s not willing to ??
I already know that telling ceph to forget about the lost PGs is not a good idea, as it will cause the VMs using them to hang afterwards… and I’d prefer  seeing ceph as a rock-solid solution allowing one to recover from
 such “usual” operations… ?

If anyone got ideas, I’d be happy … should I kill osd.44 for good and recreate it ?

Thanks

P.S : I already tried to :

“ceph tell osd.44 injectargs --debug-osd 0/5 --debug-filestore 0/5”
Or
“ceph tell osd.44 injectargs --debug-osd 20/20 --debug-filestore 20/20”

PS : I tried this before I found the bonus at the start of this email…

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com