The reason of recovery_unfound pg

Satoru Takeuchi <satoru.takeuchi@xxxxxxxxx> · Sat, 21 Aug 2021 00:25:32 +0900

Hi,

I found an `active+recovery_unfound+undersized+degraded+remapped` pg
after restarting
all nodes one by one. Could anyone give some hints why this problem
happened and how
to restore my data?

I read some documents and searched Ceph issues, but I couldn't find
enough information.

https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/#unfound-objects
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html-single/troubleshooting_guide/index#unfound-objects_diag

All OSDs are `IN` and `UP`. In addition, no OSD are added/removed
during node reboot.

I know I can use `ceph pg mark_unfound_lost`  as a last resort, but I
hesitate to do
that is because the lost PG is a part of RGW's bucket index.

# additional information

## softwares

- Ceph: v16.2.4
- Rook: v1.6.3

## the result of some commands

### ceph -s

```command
ceph -s
  cluster:
    id:     b160a475-c579-46a2-9346-416d3a229c5f
    health: HEALTH_ERR
            8/47565926 objects unfound (0.000%)
            Possible data damage: 1 pg recovery_unfound
            Degraded data redundancy: 142103/142697778 objects
degraded (0.100%), 1 pg degraded, 1 pg undersized
            1 daemons have recently crashed

  services:
    mon: 3 daemons, quorum eb,ef,eg (age 9h)
    mgr: a(active, since 9h), standbys: b
    osd: 18 osds: 18 up (since 13h), 18 in (since 13h); 1 remapped pgs
    rgw: 3 daemons active (3 hosts, 1 zones)

  data:
    pools:   12 pools, 185 pgs
    objects: 47.57M objects, 3.1 TiB
    usage:   20 TiB used, 111 TiB / 131 TiB avail
    pgs:     142103/142697778 objects degraded (0.100%)
             8/47565926 objects unfound (0.000%)
             181 active+clean
             2   active+clean+scrubbing+deep+repair
             1   active+clean+scrubbing
             1   active+recovery_unfound+undersized+degraded+remapped

  io:
    client:   114 KiB/s rd, 5.7 MiB/s wr, 22 op/s rd, 279 op/s wr
```

### ceph health detail

```
ceph health detail
HEALTH_ERR 8/47565953 objects unfound (0.000%); Possible data damage:
1 pg recovery_unfound; Degraded data redundancy: 142103/142697859
objects degraded (0.100%), 1 pg degraded, 1 pg undersized; 1 daemons
have recently crashed
[WRN] OBJECT_UNFOUND: 8/47565953 objects unfound (0.000%)
    pg 10.1 has 8 unfound objects
[ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound
    pg 10.1 is active+recovery_unfound+undersized+degraded+remapped,
acting [3,13], 8 unfound
[WRN] PG_DEGRADED: Degraded data redundancy: 142103/142697859 objects
degraded (0.100%), 1 pg degraded, 1 pg undersized
    pg 10.1 is stuck undersized for 14h, current state
active+recovery_unfound+undersized+degraded+remapped, last acting
[3,13]
[WRN] RECENT_CRASH: 1 daemons have recently crashed
    client.rgw.ceph.hdd.object.store.a crashed on host
rook-ceph-rgw-ceph-hdd-object-store-a-6d5d75c87c-z9rxz at
2021-08-16T17:24:05.471655Z
```

### ceph pg query

```
ceph pg 10.1 query
...
    "up": [
        11,
        13,
        3
    ],
    "acting": [
        3,
        13
    ],
    "backfill_targets": [
        "11"
    ],
    "acting_recovery_backfill": [
        "3",
        "11",
        "13"
    ],
...
    "recovery_state": [
        {
            "name": "Started/Primary/Active",
            "enter_time": "2021-08-16T11:24:12.402345+0000",
            "might_have_unfound": [
                {
                    "osd": "2",
                    "status": "already probed"
                },
                {
                    "osd": "7",
                    "status": "already probed"
                },
                {
                    "osd": "11",
                    "status": "already probed"
                },
                {
                    "osd": "12",
                    "status": "already probed"
                },
                {
                    "osd": "13",
                    "status": "already probed"
                },
                {
                    "osd": "15",
                    "status": "already probed"
                }
            ],
```

### ceph pg ls

```
ceph pg ls | grep -v 'active+clean'
PG     OBJECTS  DEGRADED  MISPLACED  UNFOUND  BYTES        OMAP_BYTES*
  OMAP_KEYS*  LOG    STATE
    SINCE  VERSION      REPORTED         UP             ACTING
SCRUB_STAMP                      DEEP_SCRUB_STAMP
10.1    142087    142103          0        8            0
87077343246   300322300   9716
active+recovery_unfound+undersized+degraded+remapped    14h
46437'28097849  46437:121092131   [11,13,3]p11       [3,13]p3
2021-08-15T00:10:21.615494+0000  2021-08-10T16:42:39.172001+0000
```

### ceph osd tree

```
ceph osd tree
ID   CLASS  WEIGHT     TYPE NAME                 STATUS  REWEIGHT  PRI-AFF
 -1         130.99301  root default
 -4          43.66434      zone rack0
 -3           7.27739          host 10-69-0-22
  0    hdd    7.27739              osd.0             up   1.00000  1.00000
-19           7.27739          host 10-69-0-23
  6    hdd    7.27739              osd.6             up   1.00000  1.00000
 -7           7.27739          host 10-69-0-24
  1    hdd    7.27739              osd.1             up   1.00000  1.00000
-27           7.27739          host 10-69-0-28
 13    hdd    7.27739              osd.13            up   1.00000  1.00000
 -9           7.27739          host 10-69-0-29
  2    hdd    7.27739              osd.2             up   1.00000  1.00000
-15           7.27739          host 10-69-0-30
  4    hdd    7.27739              osd.4             up   1.00000  1.00000
-24          43.66434      zone rack1
-41                 0          host 10-69-0-214
-33           7.27739          host 10-69-0-215
 12    hdd    7.27739              osd.12            up   1.00000  1.00000
-35                 0          host 10-69-0-217
-45           7.27739          host 10-69-0-218
  9    hdd    7.27739              osd.9             up   1.00000  1.00000
-29          14.55478          host 10-69-0-220
 10    hdd    7.27739              osd.10            up   1.00000  1.00000
 17    hdd    7.27739              osd.17            up   1.00000  1.00000
-23           7.27739          host 10-69-0-221
  8    hdd    7.27739              osd.8             up   1.00000  1.00000
-31           7.27739          host 10-69-0-222
 11    hdd    7.27739              osd.11            up   1.00000  1.00000
-12          43.66434      zone rack2
-21           7.27739          host 10-69-1-151
```

Thanks,
Satoru
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx