Hello all,
I have ceph Luminous setup with filestore and bluestore OSDs. This cluster was deployed initially as Hammer, than I upgraded it to Jewel and eventually to Luminous. It’s heterogenous, we have SSDs, SAS 15K and 7.2K HDDs in it (see crush map attached). Earlier I converted 7.2K HDD from filestore to bluestore without any problem. After converting two SSDs from filestore to bluestore I ended up the following warning:
cluster:
id: 089d3673-5607-404d-9351-2d4004043966
health: HEALTH_WARN
Degraded data redundancy: 12566/4361616 objects degraded (0.288%), 6 pgs unclean,
6 pgs degraded, 6 pgs undersized
10 slow requests are blocked > 32 sec
services:
mon: 3 daemons, quorum 2,1,0
mgr: tw-dwt-prx-03(active), standbys: tw-dwt-prx-05, tw-dwt-prx-07
osd: 92 osds: 92 up, 92 in; 6 remapped pgs
data:
pools: 3 pools, 1024 pgs
objects: 1419k objects, 5676 GB
usage: 17077 GB used, 264 TB / 280 TB avail
pgs: 12566/4361616 objects degraded (0.288%)
1018 active+clean
4 active+undersized+degraded+remapped+backfill_wait
2 active+undersized+degraded+remapped+backfilling
io:
client: 1567 kB/s rd, 2274 kB/s wr, 67 op/s rd, 186 op/s wr
# rados df
POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR
sas_sata 556G 142574 0 427722 0 0 0 48972431 478G 207803733 3035G
sata_only 1939M 491 0 1473 0 0 0 3302 5003k 17170 2108M
ssd_sata 5119G 1311028 0 3933084 0 0 12549 46982011 2474G 620926839 24962G
total_objects 1454093
total_used 17080G
total_avail 264T
total_space 280T
# ceph pg dump_stuck
ok
PG_STAT STATE UP UP_PRIMARY ACTING ACTING_PRIMARY
22.ac active+undersized+degraded+remapped+backfilling [6,28,62] 6 [28,62] 28
22.85 active+undersized+degraded+remapped+backfilling [7,43,62] 7 [43,62] 43
22.146 active+undersized+degraded+remapped+backfill_wait [7,48,46] 7 [46,48] 46
22.4f active+undersized+degraded+remapped+backfill_wait [7,59,58] 7 [58,59] 58
22.d8 active+undersized+degraded+remapped+backfill_wait [7,48,46] 7 [46,48] 46
22.60 active+undersized+degraded+remapped+backfill_wait [7,50,34] 7 [34,50] 34
The pool I have problem with, has replicas on SSDs and 7.2K HDD with primary affinity set as 1 for SSD and 0 for HDD.
All clients eventually ceased to operate, recovery speed is 1-2 objects per minute (which would take more than a week to recover 12500 objects). Another pool works fine.
How I can speed up recovery process?
Thank you,I have ceph Luminous setup with filestore and bluestore OSDs. This cluster was deployed initially as Hammer, than I upgraded it to Jewel and eventually to Luminous. It’s heterogenous, we have SSDs, SAS 15K and 7.2K HDDs in it (see crush map attached). Earlier I converted 7.2K HDD from filestore to bluestore without any problem. After converting two SSDs from filestore to bluestore I ended up the following warning:
cluster:
id: 089d3673-5607-404d-9351-2d4004043966
health: HEALTH_WARN
Degraded data redundancy: 12566/4361616 objects degraded (0.288%), 6 pgs unclean,
6 pgs degraded, 6 pgs undersized
10 slow requests are blocked > 32 sec
services:
mon: 3 daemons, quorum 2,1,0
mgr: tw-dwt-prx-03(active), standbys: tw-dwt-prx-05, tw-dwt-prx-07
osd: 92 osds: 92 up, 92 in; 6 remapped pgs
data:
pools: 3 pools, 1024 pgs
objects: 1419k objects, 5676 GB
usage: 17077 GB used, 264 TB / 280 TB avail
pgs: 12566/4361616 objects degraded (0.288%)
1018 active+clean
4 active+undersized+degraded+remapped+backfill_wait
2 active+undersized+degraded+remapped+backfilling
io:
client: 1567 kB/s rd, 2274 kB/s wr, 67 op/s rd, 186 op/s wr
# rados df
POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR
sas_sata 556G 142574 0 427722 0 0 0 48972431 478G 207803733 3035G
sata_only 1939M 491 0 1473 0 0 0 3302 5003k 17170 2108M
ssd_sata 5119G 1311028 0 3933084 0 0 12549 46982011 2474G 620926839 24962G
total_objects 1454093
total_used 17080G
total_avail 264T
total_space 280T
# ceph pg dump_stuck
ok
PG_STAT STATE UP UP_PRIMARY ACTING ACTING_PRIMARY
22.ac active+undersized+degraded+remapped+backfilling [6,28,62] 6 [28,62] 28
22.85 active+undersized+degraded+remapped+backfilling [7,43,62] 7 [43,62] 43
22.146 active+undersized+degraded+remapped+backfill_wait [7,48,46] 7 [46,48] 46
22.4f active+undersized+degraded+remapped+backfill_wait [7,59,58] 7 [58,59] 58
22.d8 active+undersized+degraded+remapped+backfill_wait [7,48,46] 7 [46,48] 46
22.60 active+undersized+degraded+remapped+backfill_wait [7,50,34] 7 [34,50] 34
The pool I have problem with, has replicas on SSDs and 7.2K HDD with primary affinity set as 1 for SSD and 0 for HDD.
All clients eventually ceased to operate, recovery speed is 1-2 objects per minute (which would take more than a week to recover 12500 objects). Another pool works fine.
How I can speed up recovery process?
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com