Hello Cephers,
it is a mystery. My cluster is out of error state. How - don't really know. I initiated deep scrubbing for affected pgs yesterday. Maybe that was fixing it.
Cheers,
Vadim
On 6/24/21 1:15 PM, Vadim Bulst wrote:
Dear List,
since my update yesterday from 14.2.18 to 14.2.20 i got an unhealthy cluster. As I remember right, it appeared after rebooting the second server. They are 7 missing objects from pgs of a cache pool (pool 3). This pool is now changed writeback to proxy and i'm not able to flush all objects.
root@scvirt06:/home/urzadmin/ceph_issue# ceph -s
cluster:
id: 5349724e-fa96-4fd6-8e44-8da2a39253f7
health: HEALTH_ERR
7/15893342 objects unfound (0.000%)
Possible data damage: 7 pgs recovery_unfound
Degraded data redundancy: 21/47680026 objects degraded (0.000%), 7 pgs degraded, 7 pgs undersized
client is using insecure global_id reclaim
mons are allowing insecure global_id reclaim
services:
mon: 3 daemons, quorum scvirt03,scvirt06,scvirt01 (age 19h)
mgr: scvirt04(active, since 21m), standbys: scvirt03, scvirt02
mds: scfs:1 {0=scvirt04=up:active} 1 up:standby-replay 1 up:standby
osd: 54 osds: 54 up (since 17m), 54 in (since 10w); 7 remapped pgs
task status:
scrub status:
mds.scvirt03: idle
data:
pools: 5 pools, 704 pgs
objects: 15.89M objects, 49 TiB
usage: 139 TiB used, 145 TiB / 285 TiB avail
pgs: 21/47680026 objects degraded (0.000%)
7/15893342 objects unfound (0.000%)
694 active+clean
7 active+recovery_unfound+undersized+degraded+remapped
3 active+clean+scrubbing+deep
io:
client: 3.7 MiB/s rd, 6.6 MiB/s wr, 40 op/s rd, 31 op/s wr
my cluster:
scvirt01 - mon,osds
scvirt02 - mgr,osds
scvirt03 - mon,mgr,mds,osds
scvirt04 - mgr,mds,osds
scvirt05 - osds
scvirt06 - mon,mds,osds
log of osd.49:
root@scvirt03:/home/urzadmin# tail -f /var/log/ceph/ceph-osd.49.log
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 0.64 GB write, 0.01 MB/s write, 0.54 GB read, 0.01 MB/s read, 6.5 seconds Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
** File Read Latency Histogram By Level [default] **
2021-06-24 08:53:08.865 7f88ab86c700 -1 log_channel(cluster) log [ERR] : 3.9 has 1 objects unfound and apparently lost
2021-06-24 08:53:08.865 7f88a505f700 -1 log_channel(cluster) log [ERR] : 3.1e has 1 objects unfound and apparently lost
2021-06-24 08:53:40.570 7f88ab86c700 -1 log_channel(cluster) log [ERR] : 3.9 has 1 objects unfound and apparently lost
2021-06-24 08:53:40.570 7f88a9067700 -1 log_channel(cluster) log [ERR] : 3.1e has 1 objects unfound and apparently lost
2021-06-24 08:54:45.042 7f88b487e700 4 rocksdb: [db/db_impl.cc:777] ------- DUMPING STATS -------
2021-06-24 08:54:45.042 7f88b487e700 4 rocksdb: [db/db_impl.cc:778]
** DB Stats **
Uptime(secs): 85202.3 total, 600.0 interval
Cumulative writes: 1148K writes, 8640K keys, 1148K commit groups, 1.0 writes per commit group, ingest: 1.24 GB, 0.01 MB/s
Cumulative WAL: 1148K writes, 546K syncs, 2.10 writes per sync, written: 1.24 GB, 0.01 MB/s
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 369 writes, 1758 keys, 369 commit groups, 1.0 writes per commit group, ingest: 0.41 MB, 0.00 MB/s
Interval WAL: 369 writes, 155 syncs, 2.37 writes per sync, written: 0.00 MB, 0.00 MB/s
Interval stall: 00:00:0.000 H:M:S, 0.0 percent
** Compaction Stats [default] **
Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
L0 3/0 104.40 MB 0.8 0.0 0.0 0.0 0.2 0.2 0.0 1.0 0.0 67.8 2.89 2.70 6 0.482 0 0
L1 2/0 131.98 MB 0.5 0.2 0.1 0.1 0.2 0.1 0.0 1.8 149.9 120.9 1.53 1.41 1 1.527 2293K 140K
L2 16/0 871.57 MB 0.3 0.3 0.1 0.3 0.3 -0.0 0.0 5.2 158.1 132.3 2.05 1.93 1 2.052 3997K 1089K
Sum 21/0 1.08 GB 0.0 0.5 0.2 0.4 0.6 0.2 0.0 3.3 85.5 100.8 6.47 6.03 8 0.809 6290K 1229K
Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0 0.000 0 0
If I run
ceph pg repair 3.1e
it doesn't change anything
and i do not understand why these pgs are undersized. All OSDs are up.
ceph.conf:
[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.10.144.0/24
filestore_xattr_use_omap = true
fsid = 5349724e-fa96-4fd6-8e44-8da2a39253f7
mon_allow_pool_delete = true
mon_cluster_log_file_level = info
mon_host = 172.26.8.151,172.26.8.153,172.26.8.156
osd_journal_size = 5120
osd_pool_default_min_size = 1
public_network = 172.26.8.128/26
[client]
keyring = /etc/pve/priv/$cluster.$name.keyring
[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring
[mds.scvirt03]
host = scvirt03
mds_standby_for_rank = 0
mds_standby_replay = true
[mds.scvirt04]
host = scvirt04
mds standby for name = pve
[mds.scvirt06]
host = scvirt06
mds_standby_for_rank = 0
mds_standby_replay = true
[mon.scvirt01]
public_addr = 172.26.8.151
[mon.scvirt03]
public_addr = 172.26.8.153
[mon.scvirt06]
public_addr = 172.26.8.156
ceph health detail:
HEALTH_ERR 7/15893333 objects unfound (0.000%); Possible data damage: 7 pgs recovery_unfound; Degraded data redundancy: 21/47679999 objects degraded (0.000%), 7 pgs degraded, 7 pgs undersized; client is using insecure global_id reclaim; mons are allowing insecure global_id reclaim
OBJECT_UNFOUND 7/15893333 objects unfound (0.000%)
pg 3.1e has 1 unfound objects
pg 3.1f has 1 unfound objects
pg 3.1b has 1 unfound objects
pg 3.15 has 1 unfound objects
pg 3.16 has 1 unfound objects
pg 3.b has 1 unfound objects
pg 3.9 has 1 unfound objects
PG_DAMAGED Possible data damage: 7 pgs recovery_unfound
pg 3.9 is active+recovery_unfound+undersized+degraded+remapped, acting [49,52], 1 unfound
pg 3.b is active+recovery_unfound+undersized+degraded+remapped, acting [43,52], 1 unfound
pg 3.15 is active+recovery_unfound+undersized+degraded+remapped, acting [44,52], 1 unfound
pg 3.16 is active+recovery_unfound+undersized+degraded+remapped, acting [43,51], 1 unfound
pg 3.1b is active+recovery_unfound+undersized+degraded+remapped, acting [43,52], 1 unfound
pg 3.1e is active+recovery_unfound+undersized+degraded+remapped, acting [49,51], 1 unfound
pg 3.1f is active+recovery_unfound+undersized+degraded+remapped, acting [48,51], 1 unfound
PG_DEGRADED Degraded data redundancy: 21/47679999 objects degraded (0.000%), 7 pgs degraded, 7 pgs undersized
pg 3.9 is stuck undersized for 64516.343966, current state active+recovery_unfound+undersized+degraded+remapped, last acting [49,52]
pg 3.b is stuck undersized for 64516.351507, current state active+recovery_unfound+undersized+degraded+remapped, last acting [43,52]
pg 3.15 is stuck undersized for 64521.368841, current state active+recovery_unfound+undersized+degraded+remapped, last acting [44,52]
pg 3.16 is stuck undersized for 64516.351599, current state active+recovery_unfound+undersized+degraded+remapped, last acting [43,51]
pg 3.1b is stuck undersized for 64517.427120, current state active+recovery_unfound+undersized+degraded+remapped, last acting [43,52]
pg 3.1e is stuck undersized for 64521.369635, current state active+recovery_unfound+undersized+degraded+remapped, last acting [49,51]
pg 3.1f is stuck undersized for 64517.426392, current state active+recovery_unfound+undersized+degraded+remapped, last acting [48,51]
AUTH_INSECURE_GLOBAL_ID_RECLAIM client is using insecure global_id reclaim
client.admin at 172.26.8.154:0/3925203408 is using insecure global_id reclaim
mds.scvirt04 at [v2:172.26.8.154:6836/3778505565,v1:172.26.8.154:6837/3778505565] is using insecure global_id reclaim
AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED mons are allowing insecure global_id reclaim
mon.scvirt03 has auth_allow_insecure_global_id_reclaim set to true
mon.scvirt06 has auth_allow_insecure_global_id_reclaim set to true
mon.scvirt01 has auth_allow_insecure_global_id_reclaim set to true
ceph osd tree:
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 284.51312 root default
-2 48.75215 host scvirt01
0 hdd 9.09560 osd.0 up 1.00000 1.00000
3 hdd 9.09560 osd.3 up 1.00000 1.00000
6 hdd 9.09560 osd.6 up 1.00000 1.00000
9 hdd 9.09560 osd.9 up 1.00000 1.00000
12 hdd 9.09560 osd.12 up 1.00000 1.00000
42 nvme 0.97029 osd.42 up 1.00000 1.00000
43 nvme 0.97029 osd.43 up 1.00000 1.00000
44 nvme 0.97029 osd.44 up 1.00000 1.00000
37 ssd 0.36330 osd.37 up 1.00000 1.00000
-3 48.75215 host scvirt02
1 hdd 9.09560 osd.1 up 1.00000 1.00000
4 hdd 9.09560 osd.4 up 1.00000 1.00000
7 hdd 9.09560 osd.7 up 1.00000 1.00000
10 hdd 9.09560 osd.10 up 1.00000 1.00000
13 hdd 9.09560 osd.13 up 1.00000 1.00000
45 nvme 0.97029 osd.45 up 1.00000 1.00000
46 nvme 0.97029 osd.46 up 1.00000 1.00000
47 nvme 0.97029 osd.47 up 1.00000 1.00000
38 ssd 0.36330 osd.38 up 1.00000 1.00000
-4 48.75224 host scvirt03
2 hdd 9.09569 osd.2 up 1.00000 1.00000
5 hdd 9.09560 osd.5 up 1.00000 1.00000
8 hdd 9.09560 osd.8 up 1.00000 1.00000
11 hdd 9.09560 osd.11 up 1.00000 1.00000
14 hdd 9.09560 osd.14 up 1.00000 1.00000
48 nvme 0.97029 osd.48 up 1.00000 1.00000
49 nvme 0.97029 osd.49 up 1.00000 1.00000
50 nvme 0.97029 osd.50 up 1.00000 1.00000
39 ssd 0.36330 osd.39 up 1.00000 1.00000
-9 56.75706 host scvirt04
15 hdd 9.09560 osd.15 up 1.00000 1.00000
17 hdd 9.09560 osd.17 up 1.00000 1.00000
20 hdd 9.09560 osd.20 up 1.00000 1.00000
22 hdd 9.09560 osd.22 up 1.00000 1.00000
23 hdd 9.09560 osd.23 up 1.00000 1.00000
25 hdd 3.63860 osd.25 up 1.00000 1.00000
26 hdd 3.63860 osd.26 up 1.00000 1.00000
27 hdd 3.63860 osd.27 up 1.00000 1.00000
40 ssd 0.36330 osd.40 up 1.00000 1.00000
-11 56.75706 host scvirt05
16 hdd 9.09560 osd.16 up 1.00000 1.00000
18 hdd 9.09560 osd.18 up 1.00000 1.00000
19 hdd 9.09560 osd.19 up 1.00000 1.00000
21 hdd 9.09560 osd.21 up 1.00000 1.00000
24 hdd 9.09560 osd.24 up 1.00000 1.00000
28 hdd 3.63860 osd.28 up 1.00000 1.00000
29 hdd 3.63860 osd.29 up 1.00000 1.00000
30 hdd 3.63860 osd.30 up 1.00000 1.00000
41 ssd 0.36330 osd.41 up 1.00000 1.00000
-13 24.74245 host scvirt06
31 hdd 3.63860 osd.31 up 1.00000 1.00000
32 hdd 3.63860 osd.32 up 1.00000 1.00000
33 hdd 3.63860 osd.33 up 1.00000 1.00000
34 hdd 3.63860 osd.34 up 1.00000 1.00000
35 hdd 3.63860 osd.35 up 1.00000 1.00000
36 hdd 3.63860 osd.36 up 1.00000 1.00000
51 nvme 0.97029 osd.51 up 1.00000 1.00000
52 nvme 0.97029 osd.52 up 1.00000 1.00000
53 nvme 0.97029 osd.53 up 1.00000 1.00000
Regards,
Vadim
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx
-- Vadim Bulst Universität Leipzig / URZ 04109 Leipzig, Augustusplatz 10 phone: +49-341-97-33380 mail: vadim.bulst@xxxxxxxxxxxxxx
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx