Re: Ceph objects unfound

Eugen Block <eblock@xxxxxx> · Wed, 27 Jul 2022 14:31:35 +0000

Hi,

this seems to be another example why a pool size = 2 is a bad idea.  
This has been discussed so many times...

- If we stop osd.131 the PG becomes inactive and down (like it is  
the only osd containing the objects): Reduced data availability: 1  
pg inactive, 1 pg down

Because it is, ceph can't find the other replica.

- Used ceph-objectstore-tool to search for the unfound object  
(rbd_data.ad5ab66b8b4567.0000000000011055) on all osd's involved,  
the object is present only on osd.41 and osd.131 even the PG is  
mapped to other OSD's.

Can you export it and import on a different OSD [1] [2]? I haven't  
tried that yet but maybe this will work.

[1] https://docs.ceph.com/en/pacific/man/8/ceph-objectstore-tool/
[2]  
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/7AWMDL5CWKW2WBHM7TVIRLXYJSNS5EIX/

Zitat von Martin Culcea <martin_culcea@xxxxxxxxx>:

Hello,

After a host reboot the cluster could not find an object. The  
cluster was in stable state with all osd active+clean, no OSD was  
out, no other OSD was restarted during host reboot. It was 1 month  
ago, we hoped that the cluster will find the object eventually, but  
it did not.
Cluster version: ceph version 16.2.9, ceph-deploy cluster, pool size 2.

Attached are ceph.log, osds logs, pg query and other logs

Cluster status:
 cluster:
 id: 2517da9e-af62-405e-b71f-1f2e145822f7
 health: HEALTH_ERR
 client is using insecure global_id reclaim
 mons are allowing insecure global_id reclaim
 1/606943089 objects unfound (0.000%)
 Possible data damage: 1 pg recovery_unfound
 Degraded data redundancy: 7252/1219946300 objects degraded  
(0.001%), 1 pg degraded, 1 pg undersized
 1 pgs not deep-scrubbed in time
 1 pgs not scrubbed in time

data:
 volumes: 1/1 healthy
 pools: 12 pools, 6560 pgs
 objects: 606.94M objects, 85 TiB
 usage: 169 TiB used, 268 TiB / 438 TiB avail
 pgs: 7252/1219946300 objects degraded (0.001%)
 7250/1219946300 objects misplaced (0.001%)
 1/606943089 objects unfound (0.000%)
 6554 active+clean
 4 active+clean+scrubbing+deep
 1 active+recovery_unfound+undersized+degraded+remapped
 1 active+clean+scrubbing

io:
 client: 1.2 GiB/s rd, 1.4 GiB/s wr, 40.87k op/s rd, 72.80k op/s wr

progress:
 Global Recovery Event (2w)
 [===========================.] (remaining: 4m)

Ceph health detail

HEALTH_ERR clients are using insecure global_id reclaim; mons are  
allowing insecure global_id reclaim; 1/606997573 objects unfound  
(0.000%); Possible data damage: 1 pg recovery_unfound; Degraded data  
redundancy: 7294/1220048932 objects degraded (0.001%), 1 pg  
degraded, 1 pg undersized; 1 pgs not deep-scrubbed in time; 1 pgs  
not scrubbed in time
...
[WRN] OBJECT_UNFOUND: 1/606997573 objects unfound (0.000%)
 pg 16.1e has 1 unfound objects
[ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound
 pg 16.1e is active+recovery_unfound+undersized+degraded+remapped,  
acting [131], 1 unfound
[WRN] PG_DEGRADED: Degraded data redundancy: 7294/1220048932 objects  
degraded (0.001%), 1 pg degraded, 1 pg undersized
 pg 16.1e is stuck undersized for 3h, current state  
active+recovery_unfound+undersized+degraded+remapped, last acting  
[131]
[WRN] PG_NOT_DEEP_SCRUBBED: 1 pgs not deep-scrubbed in time
 pg 16.1e not deep-scrubbed since 2022-06-03T01:20:13.786232+0300
[WRN] PG_NOT_SCRUBBED: 1 pgs not scrubbed in time
 pg 16.1e not scrubbed since 2022-06-09T03:27:36.771392+0300

The PG is acting only on osd.131, even we move the PG to other OSD:
ceph pg map 16.1e
osdmap e723093 pg 16.1e (16.1e) -> up [41,141] acting [131]

On ceph osd dump the pg is mapped as a pg_temp:

ceph osd dump | grep -w 16.1e
pg_temp 16.1e [131]

What we did:
- restarted all osd and hosts involved
- force a deep-scrub on PG (the pg cannot be scrubed anymore)
- If we stop osd.131 the PG becomes inactive and down (like it is  
the only osd containing the objects): Reduced data availability: 1  
pg inactive, 1 pg down
- If we take out the osd.131, the pg is not moving to the new osd,  
it remains the only object on osd.131
- ceph force recovery
- ceph force repeer
- ceph pg repair 16.1e
- Used ceph-objectstore-tool to search for the unfound object  
(rbd_data.ad5ab66b8b4567.0000000000011055) on all osd's involved,  
the object is present only on osd.41 and osd.131 even the PG is  
mapped to other OSD's.
- ceph-objectstore-tool
 ceph pg remap: we tryed to remap the pg to others OSD's (ceph osd  
pg-upmap-items 16.1e 131 141) but the PG does not move to new OSD's,  
remain on osd.41 and osd.131 (ceph pg map 16.1e: osdmap e723093 pg  
16.1e (16.1e) -> up [41,141] acting [131])

Why is this happening ?
How can we help the cluster to find the lost object?
Can we remove pg_temp 16.1e [131] from upmap (ceph osd dump) ?

Thank you,
Martin Culcea

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx