How can I fix "object unfound" error?

Simone Lazzaris <simone.lazzaris@xxxxxxx> · Mon, 02 Mar 2020 09:52:59 +0100

Hi there;
I've got a ceph cluster with 4 nodes, each with 9 4TB drives.
Last night a disk failed, and unfortunately this lead to a kernel panic on the hosting server 
(supermicro: never again).
One reboot later, the cluster rebalances.

This morning, I'm in this situation:

root@s3:~# ceph status
  cluster:
    id:     9ec27b0f-acfd-40a3-b35d-db301ac5ce8c
    health: HEALTH_ERR
            1/13122293 objects unfound (0.000%)
            Possible data damage: 1 pg backfill_unfound
            Degraded data redundancy: 1 pg undersized
            27 slow ops, oldest one blocked for 68 sec, osd.5 has slow ops

  services:
    mon: 3 daemons, quorum s1,s2,s3 (age 11h)
    mgr: s1(active, since 6w), standbys: s2, s3
    osd: 36 osds: 35 up (since 11h), 35 in (since 11h); 21 remapped pgs
    rgw: 3 daemons active (s1, s2, s3)

  data:
    pools:   10 pools, 1200 pgs
    objects: 13.12M objects, 41 TiB
    usage:   63 TiB used, 65 TiB / 127 TiB avail
    pgs:     186357/39366879 objects misplaced (0.473%)
             1/13122293 objects unfound (0.000%)
             1179 active+clean
             11   active+remapped+backfilling
             9    active+remapped+backfill_wait
             1    active+backfill_unfound+undersized+remapped

  io:
    client:   42 KiB/s rd, 5.2 MiB/s wr, 43 op/s rd, 11 op/s wr
    recovery: 163 MiB/s, 48 objects/s

One PG is in "backfill_unfound" status. The PG is the 6.36a, which is on server 1; the failed disk 
is the OSD.5, on server 3 (which was rebooted after the panic) so I don't understand the 
relation.

This is the unfound object:
root@s3:~# ceph pg 6.36a list_unfound
{
    "num_missing": 1,
    "num_unfound": 1,
    "objects": [
        {
            "oid": {
                "oid": "8a257939-05c9-4ba8-9fd3-fb8504226607.4332.4__shadow_.H5AtB0LjzRSbUWy-
hnVSLf4fs884okG_1",
                "key": "",
                "snapid": -2,
                "hash": 961006442,
                "max": 0,
                "pool": 6,
                "namespace": ""
            },
            "need": "263'18213",
            "have": "0'0",
            "flags": "none",
            "locations": []
        }
    ],
    "more": false
}

How can I handle this error? The docs are not much comforting, as far as I can see the only 
thing to do is to mark the missing object as lost and try to cope with that. I'd prefer not.

Any ideas?

*Simone Lazzaris*
*Qcom S.p.A.*
simone.lazzaris@xxxxxxx[1] | www.qcom.it[2]
* LinkedIn[3]* | *Facebook*[4]

--------
[1] mailto:simone.lazzaris@xxxxxxx
[2] https://www.qcom.it
[3] https://www.linkedin.com/company/qcom-spa
[4] http://www.facebook.com/qcomspa
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx