Hi,
the flush from the overlay cache for my ec-based cephfs is very very slow, as are all operations on the cephfs. The flush accelerates when the mds is stopped.
I think this is due to a large number of files that were deleted all at once, but I'm not sure how to verify that. Are there any counters I can look up that show things like "pending deletions"? How else can I debug the problem?
Any insight is very much appreciated.
Philip
(potentially helpful debug output follows)
status:
root@lxt-prod-ceph-mon02:~# ceph -s
cluster:
id: 066f558c-6789-4a93-aaf1-5af1ba01a3ad
health: HEALTH_WARN
noscrub,nodeep-scrub flag(s) set
102 slow requests are blocked > 32 sec
services:
mon: 2 daemons, quorum lxt-prod-ceph-mon01,lxt-prod-ceph-mon02
mgr: lxt-prod-ceph-mon02(active), standbys: lxt-prod-ceph-mon01
mds: plexfs-1/1/1 up {0=lxt-prod-ceph-mds01=up:active}
osd: 13 osds: 7 up, 7 in
flags noscrub,nodeep-scrub
data:
pools: 3 pools, 536 pgs
objects: 5431k objects, 21056 GB
usage: 28442 GB used, 5319 GB / 33761 GB avail
pgs: 536 active+clean
io:
client: 687 kB/s wr, 0 op/s rd, 9 op/s wr
cache: 345 kB/s flush
(Throughput is currently in the kilobyte/ low megabyte range, but could go to 100MB/s under healthy conditions)
health:
root@lxt-prod-ceph-mon02:~# ceph health detail
HEALTH_WARN noscrub,nodeep-scrub flag(s) set; 105 slow requests are blocked > 32 sec
OSDMAP_FLAGS noscrub,nodeep-scrub flag(s) set
REQUEST_SLOW 105 slow requests are blocked > 32 sec
45 ops are blocked > 262.144 sec
29 ops are blocked > 131.072 sec
20 ops are blocked > 65.536 sec
11 ops are blocked > 32.768 sec
osds 1,7 have blocked requests > 262.144 sec
(all osds have a high system load, but not a lot of iowait. cephfs/flushing usually performs much better with the same conditions)
pool configuration:
root@lxt-prod-ceph-mon02:~# ceph osd pool ls detail
pool 6 'cephfs-metadata' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 last_change 12515 lfor 0/12412 flags hashpspool stripe_width 0 application cephfs
pool 9 'cephfs-data' erasure size 4 min_size 3 crush_rule 4 object_hash rjenkins pg_num 512 pgp_num 512 last_change 12482 lfor 12481/12481 flags hashpspool crash_replay_interval 45 tiers 17 read_tier 17 write_tier 17 stripe_width 4128 application cephfs
pool 17 'cephfs-cache' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 12553 lfor 12481/12481 flags hashpspool,incomplete_clones,noscrub,nodeep-scrub tier_of 9 cache_mode writeback target_bytes 200000000000 target_objects 150000 hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 180s x1 decay_rate 20 search_last_n 1 min_write_recency_for_promote 1 stripe_width 0
metadata and cache are both on the same ssd osd:
root@lxt-prod-ceph-mon02:~# ceph osd crush tree
ID CLASS WEIGHT TYPE NAME
-5 0.24399 root ssd
7 ssd 0.24399 osd.7
-4 5.74399 host sinnlich
6 hdd 5.50000 osd.6
7 ssd 0.24399 osd.7
-1 38.40399 root hdd
-2 16.45799 host hn-lxt-ceph01
1 hdd 5.50000 osd.1
9 hdd 5.50000 osd.9
12 hdd 5.50000 osd.12
-3 16.44600 host hn-lxt-ceph02
2 hdd 5.50000 osd.2
3 hdd 5.50000 osd.3
4 hdd 2.72299 osd.4
5 hdd 2.72299 osd.5
6 hdd 5.50000 osd.6
cache tier settings:
root@lxt-prod-ceph-mon02:~# ceph osd pool get cephfs-cache all
size: 1
min_size: 1
crash_replay_interval: 0
pg_num: 8
pgp_num: 8
crush_rule: replicated_ruleset
hashpspool: true
nodelete: false
nopgchange: false
nosizechange: false
write_fadvise_dontneed: false
noscrub: true
nodeep-scrub: true
hit_set_type: bloom
hit_set_period: 180
hit_set_count: 1
hit_set_fpp: 0.05
use_gmt_hitset: 1
auid: 0
target_max_objects: 150000
target_max_bytes: 200000000000
cache_target_dirty_ratio: 0.01
cache_target_dirty_high_ratio: 0.1
cache_target_full_ratio: 0.8
cache_min_flush_age: 60
cache_min_evict_age: 0
min_read_recency_for_promote: 0
min_write_recency_for_promote: 1
fast_read: 0
hit_set_grade_decay_rate: 20
hit_set_search_last_n: 1
(I'm not sure the values make much sense, I copied them from online examples and adapted them minimally if at all)
the mds shows no ops in flight, but the ssd osd shows a lot of those operations that seem to be slow (all of them with the same events timeline stopping at reached_pg):
root@sinnlich:~# ceph daemon osd.7 dump_ops_in_flight| head -30
{
"ops": [
{
"description": "osd_op(mds.0.3479:170284 17.1 17:98fc84de:::100002a830d.00000000:head [delete] snapc 1=[] ondisk+write+known_if_redirected+full_force e12553)",
"initiated_at": "2018-05-23 21:27:00.140552",
"age": 47.611064,
"duration": 47.611077,
"type_data": {
"flag_point": "reached pg",
"events": [
{
"time": "2018-05-23 21:27:00.140552",
"event": "initiated"
},
{
"time": "2018-05-23 21:27:00.140600",
"event": "queued_for_pg"
},
{
"time": "2018-05-23 21:27:00.140624",
"event": "reached_pg"
}
]
}
},
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com