Missing objects in pg

Vadim Bulst <vadim.bulst@xxxxxxxxxxxxxx> · Thu, 24 Jun 2021 13:15:32 +0200

Dear List,

since my update yesterday from 14.2.18 to 14.2.20 i got an unhealthy 
cluster. As I remember right, it appeared after rebooting the second 
server. They are 7 missing objects from pgs of a cache pool (pool 3). 
This pool is now changed writeback to proxy and i'm not able to flush 
all objects.

root@scvirt06:/home/urzadmin/ceph_issue# ceph -s
  cluster:
    id:     5349724e-fa96-4fd6-8e44-8da2a39253f7
    health: HEALTH_ERR
            7/15893342 objects unfound (0.000%)
            Possible data damage: 7 pgs recovery_unfound
            Degraded data redundancy: 21/47680026 objects degraded 
(0.000%), 7 pgs degraded, 7 pgs undersized
            client is using insecure global_id reclaim
            mons are allowing insecure global_id reclaim

  services:
    mon: 3 daemons, quorum scvirt03,scvirt06,scvirt01 (age 19h)
    mgr: scvirt04(active, since 21m), standbys: scvirt03, scvirt02
    mds: scfs:1 {0=scvirt04=up:active} 1 up:standby-replay 1 up:standby
    osd: 54 osds: 54 up (since 17m), 54 in (since 10w); 7 remapped pgs

  task status:
    scrub status:
        mds.scvirt03: idle

  data:
    pools:   5 pools, 704 pgs
    objects: 15.89M objects, 49 TiB
    usage:   139 TiB used, 145 TiB / 285 TiB avail
    pgs:     21/47680026 objects degraded (0.000%)
             7/15893342 objects unfound (0.000%)
             694 active+clean
             7 active+recovery_unfound+undersized+degraded+remapped
             3   active+clean+scrubbing+deep

  io:
    client:   3.7 MiB/s rd, 6.6 MiB/s wr, 40 op/s rd, 31 op/s wr

my cluster:

scvirt01 - mon,osds

scvirt02 - mgr,osds

scvirt03 - mon,mgr,mds,osds

scvirt04 - mgr,mds,osds

scvirt05 - osds

scvirt06 - mon,mds,osds

log of osd.49:

root@scvirt03:/home/urzadmin# tail -f /var/log/ceph/ceph-osd.49.log
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 0.64 GB write, 0.01 MB/s write, 0.54 GB read, 
0.01 MB/s read, 6.5 seconds Interval compaction: 0.00 GB write, 0.00 
MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds Stalls(count): 0 
level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 
level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 
slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 
memtable_slowdown, interval 0 total count

** File Read Latency Histogram By Level [default] **

2021-06-24 08:53:08.865 7f88ab86c700 -1 log_channel(cluster) log [ERR] : 
3.9 has 1 objects unfound and apparently lost
2021-06-24 08:53:08.865 7f88a505f700 -1 log_channel(cluster) log [ERR] : 
3.1e has 1 objects unfound and apparently lost
2021-06-24 08:53:40.570 7f88ab86c700 -1 log_channel(cluster) log [ERR] : 
3.9 has 1 objects unfound and apparently lost
2021-06-24 08:53:40.570 7f88a9067700 -1 log_channel(cluster) log [ERR] : 
3.1e has 1 objects unfound and apparently lost
2021-06-24 08:54:45.042 7f88b487e700  4 rocksdb: [db/db_impl.cc:777] 
------- DUMPING STATS -------
2021-06-24 08:54:45.042 7f88b487e700  4 rocksdb: [db/db_impl.cc:778]
** DB Stats **
Uptime(secs): 85202.3 total, 600.0 interval
Cumulative writes: 1148K writes, 8640K keys, 1148K commit groups, 1.0 
writes per commit group, ingest: 1.24 GB, 0.01 MB/s
Cumulative WAL: 1148K writes, 546K syncs, 2.10 writes per sync, written: 
1.24 GB, 0.01 MB/s
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 369 writes, 1758 keys, 369 commit groups, 1.0 writes 
per commit group, ingest: 0.41 MB, 0.00 MB/s
Interval WAL: 369 writes, 155 syncs, 2.37 writes per sync, written: 0.00 
MB, 0.00 MB/s
Interval stall: 00:00:0.000 H:M:S, 0.0 percent

** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) 
Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) 
Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      3/0   104.40 MB   0.8      0.0     0.0      0.0 0.2 0.2       
0.0   1.0      0.0     67.8 2.89              2.70 6    0.482       0      0
  L1      2/0   131.98 MB   0.5      0.2     0.1      0.1 0.2 0.1       
0.0   1.8    149.9    120.9 1.53              1.41 1    1.527   2293K   140K
  L2     16/0   871.57 MB   0.3      0.3     0.1      0.3 0.3 
-0.0       0.0   5.2    158.1    132.3 2.05 1.93         1    2.052   
3997K  1089K
 Sum     21/0    1.08 GB   0.0      0.5     0.2      0.4 0.6 0.2       
0.0   3.3     85.5    100.8 6.47              6.03 8    0.809   6290K  1229K
 Int      0/0    0.00 KB   0.0      0.0     0.0      0.0 0.0 0.0       
0.0   0.0      0.0      0.0 0.00              0.00 0    0.000       0      0

If I run

ceph pg repair 3.1e

it doesn't change anything

and i do not understand why these pgs are undersized. All OSDs are up.

ceph.conf:

[global]
         auth_client_required = cephx
         auth_cluster_required = cephx
         auth_service_required = cephx
         cluster_network = 10.10.144.0/24
         filestore_xattr_use_omap = true
         fsid = 5349724e-fa96-4fd6-8e44-8da2a39253f7
         mon_allow_pool_delete = true
         mon_cluster_log_file_level = info
         mon_host = 172.26.8.151,172.26.8.153,172.26.8.156
         osd_journal_size = 5120
         osd_pool_default_min_size = 1
         public_network = 172.26.8.128/26

[client]
         keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
         keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.scvirt03]
         host = scvirt03
         mds_standby_for_rank = 0
         mds_standby_replay = true

[mds.scvirt04]
         host = scvirt04
         mds standby for name = pve

[mds.scvirt06]
         host = scvirt06
         mds_standby_for_rank = 0
         mds_standby_replay = true

[mon.scvirt01]
         public_addr = 172.26.8.151

[mon.scvirt03]
         public_addr = 172.26.8.153

[mon.scvirt06]
         public_addr = 172.26.8.156

ceph health detail:

HEALTH_ERR 7/15893333 objects unfound (0.000%); Possible data damage: 7 
pgs recovery_unfound; Degraded data redundancy: 21/47679999 objects 
degraded (0.000%), 7 pgs degraded, 7 pgs undersized; client is using 
insecure global_id reclaim; mons are allowing insecure global_id reclaim
OBJECT_UNFOUND 7/15893333 objects unfound (0.000%)
    pg 3.1e has 1 unfound objects
    pg 3.1f has 1 unfound objects
    pg 3.1b has 1 unfound objects
    pg 3.15 has 1 unfound objects
    pg 3.16 has 1 unfound objects
    pg 3.b has 1 unfound objects
    pg 3.9 has 1 unfound objects
PG_DAMAGED Possible data damage: 7 pgs recovery_unfound
    pg 3.9 is active+recovery_unfound+undersized+degraded+remapped, 
acting [49,52], 1 unfound
    pg 3.b is active+recovery_unfound+undersized+degraded+remapped, 
acting [43,52], 1 unfound
    pg 3.15 is active+recovery_unfound+undersized+degraded+remapped, 
acting [44,52], 1 unfound
    pg 3.16 is active+recovery_unfound+undersized+degraded+remapped, 
acting [43,51], 1 unfound
    pg 3.1b is active+recovery_unfound+undersized+degraded+remapped, 
acting [43,52], 1 unfound
    pg 3.1e is active+recovery_unfound+undersized+degraded+remapped, 
acting [49,51], 1 unfound
    pg 3.1f is active+recovery_unfound+undersized+degraded+remapped, 
acting [48,51], 1 unfound
PG_DEGRADED Degraded data redundancy: 21/47679999 objects degraded 
(0.000%), 7 pgs degraded, 7 pgs undersized
    pg 3.9 is stuck undersized for 64516.343966, current state 
active+recovery_unfound+undersized+degraded+remapped, last acting [49,52]
    pg 3.b is stuck undersized for 64516.351507, current state 
active+recovery_unfound+undersized+degraded+remapped, last acting [43,52]
    pg 3.15 is stuck undersized for 64521.368841, current state 
active+recovery_unfound+undersized+degraded+remapped, last acting [44,52]
    pg 3.16 is stuck undersized for 64516.351599, current state 
active+recovery_unfound+undersized+degraded+remapped, last acting [43,51]
    pg 3.1b is stuck undersized for 64517.427120, current state 
active+recovery_unfound+undersized+degraded+remapped, last acting [43,52]
    pg 3.1e is stuck undersized for 64521.369635, current state 
active+recovery_unfound+undersized+degraded+remapped, last acting [49,51]
    pg 3.1f is stuck undersized for 64517.426392, current state 
active+recovery_unfound+undersized+degraded+remapped, last acting [48,51]
AUTH_INSECURE_GLOBAL_ID_RECLAIM client is using insecure global_id reclaim
    client.admin at 172.26.8.154:0/3925203408 is using insecure 
global_id reclaim
    mds.scvirt04 at 
[v2:172.26.8.154:6836/3778505565,v1:172.26.8.154:6837/3778505565] is 
using insecure global_id reclaim
AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED mons are allowing insecure 
global_id reclaim
    mon.scvirt03 has auth_allow_insecure_global_id_reclaim set to true
    mon.scvirt06 has auth_allow_insecure_global_id_reclaim set to true
    mon.scvirt01 has auth_allow_insecure_global_id_reclaim set to true

ceph osd tree:

ID  CLASS WEIGHT    TYPE NAME         STATUS REWEIGHT PRI-AFF
 -1       284.51312 root default
 -2        48.75215     host scvirt01
  0   hdd   9.09560         osd.0         up  1.00000 1.00000
  3   hdd   9.09560         osd.3         up  1.00000 1.00000
  6   hdd   9.09560         osd.6         up  1.00000 1.00000
  9   hdd   9.09560         osd.9         up  1.00000 1.00000
 12   hdd   9.09560         osd.12        up  1.00000 1.00000
 42  nvme   0.97029         osd.42        up  1.00000 1.00000
 43  nvme   0.97029         osd.43        up  1.00000 1.00000
 44  nvme   0.97029         osd.44        up  1.00000 1.00000
 37   ssd   0.36330         osd.37        up  1.00000 1.00000
 -3        48.75215     host scvirt02
  1   hdd   9.09560         osd.1         up  1.00000 1.00000
  4   hdd   9.09560         osd.4         up  1.00000 1.00000
  7   hdd   9.09560         osd.7         up  1.00000 1.00000
 10   hdd   9.09560         osd.10        up  1.00000 1.00000
 13   hdd   9.09560         osd.13        up  1.00000 1.00000
 45  nvme   0.97029         osd.45        up  1.00000 1.00000
 46  nvme   0.97029         osd.46        up  1.00000 1.00000
 47  nvme   0.97029         osd.47        up  1.00000 1.00000
 38   ssd   0.36330         osd.38        up  1.00000 1.00000
 -4        48.75224     host scvirt03
  2   hdd   9.09569         osd.2         up  1.00000 1.00000
  5   hdd   9.09560         osd.5         up  1.00000 1.00000
  8   hdd   9.09560         osd.8         up  1.00000 1.00000
 11   hdd   9.09560         osd.11        up  1.00000 1.00000
 14   hdd   9.09560         osd.14        up  1.00000 1.00000
 48  nvme   0.97029         osd.48        up  1.00000 1.00000
 49  nvme   0.97029         osd.49        up  1.00000 1.00000
 50  nvme   0.97029         osd.50        up  1.00000 1.00000
 39   ssd   0.36330         osd.39        up  1.00000 1.00000
 -9        56.75706     host scvirt04
 15   hdd   9.09560         osd.15        up  1.00000 1.00000
 17   hdd   9.09560         osd.17        up  1.00000 1.00000
 20   hdd   9.09560         osd.20        up  1.00000 1.00000
 22   hdd   9.09560         osd.22        up  1.00000 1.00000
 23   hdd   9.09560         osd.23        up  1.00000 1.00000
 25   hdd   3.63860         osd.25        up  1.00000 1.00000
 26   hdd   3.63860         osd.26        up  1.00000 1.00000
 27   hdd   3.63860         osd.27        up  1.00000 1.00000
 40   ssd   0.36330         osd.40        up  1.00000 1.00000
-11        56.75706     host scvirt05
 16   hdd   9.09560         osd.16        up  1.00000 1.00000
 18   hdd   9.09560         osd.18        up  1.00000 1.00000
 19   hdd   9.09560         osd.19        up  1.00000 1.00000
 21   hdd   9.09560         osd.21        up  1.00000 1.00000
 24   hdd   9.09560         osd.24        up  1.00000 1.00000
 28   hdd   3.63860         osd.28        up  1.00000 1.00000
 29   hdd   3.63860         osd.29        up  1.00000 1.00000
 30   hdd   3.63860         osd.30        up  1.00000 1.00000
 41   ssd   0.36330         osd.41        up  1.00000 1.00000
-13        24.74245     host scvirt06
 31   hdd   3.63860         osd.31        up  1.00000 1.00000
 32   hdd   3.63860         osd.32        up  1.00000 1.00000
 33   hdd   3.63860         osd.33        up  1.00000 1.00000
 34   hdd   3.63860         osd.34        up  1.00000 1.00000
 35   hdd   3.63860         osd.35        up  1.00000 1.00000
 36   hdd   3.63860         osd.36        up  1.00000 1.00000
 51  nvme   0.97029         osd.51        up  1.00000 1.00000
 52  nvme   0.97029         osd.52        up  1.00000 1.00000
 53  nvme   0.97029         osd.53        up  1.00000 1.00000

Regards,

Vadim

--
Vadim Bulst

Universität Leipzig / URZ
04109  Leipzig, Augustusplatz 10

phone:   +49-341-97-33380
mail:    vadim.bulst@xxxxxxxxxxxxxx

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx