Flush very, very slow

Philip Poten <philip.poten@xxxxxxxxx> · Wed, 23 May 2018 23:36:05 +0200

Hi,
the flush from the overlay cache for my ec-based cephfs is very very slow, as are all operations on the cephfs. The flush accelerates when the mds is stopped.

I think this is due to a large number of files that were deleted all at once, but I'm not sure how to verify that. Are there any counters I can look up that show things like "pending deletions"? How else can I debug the problem?

Any insight is very much appreciated.

Philip

(potentially helpful debug output follows)

status:

root@lxt-prod-ceph-mon02:~# ceph -s
  cluster:
    id:     066f558c-6789-4a93-aaf1-5af1ba01a3ad
    health: HEALTH_WARN
            noscrub,nodeep-scrub flag(s) set
            102 slow requests are blocked > 32 sec

  services:
    mon: 2 daemons, quorum lxt-prod-ceph-mon01,lxt-prod-ceph-mon02
    mgr: lxt-prod-ceph-mon02(active), standbys: lxt-prod-ceph-mon01
    mds: plexfs-1/1/1 up  {0=lxt-prod-ceph-mds01=up:active}
    osd: 13 osds: 7 up, 7 in
         flags noscrub,nodeep-scrub

  data:
    pools:   3 pools, 536 pgs
    objects: 5431k objects, 21056 GB
    usage:   28442 GB used, 5319 GB / 33761 GB avail
    pgs:     536 active+clean

  io:
    client:   687 kB/s wr, 0 op/s rd, 9 op/s wr
    cache:    345 kB/s flush

(Throughput is currently in the kilobyte/ low megabyte range, but could go to 100MB/s under healthy conditions)

health:

root@lxt-prod-ceph-mon02:~# ceph health detail
HEALTH_WARN noscrub,nodeep-scrub flag(s) set; 105 slow requests are blocked > 32 sec
OSDMAP_FLAGS noscrub,nodeep-scrub flag(s) set
REQUEST_SLOW 105 slow requests are blocked > 32 sec
    45 ops are blocked > 262.144 sec
    29 ops are blocked > 131.072 sec
    20 ops are blocked > 65.536 sec
    11 ops are blocked > 32.768 sec
    osds 1,7 have blocked requests > 262.144 sec

(all osds have a high system load, but not a lot of iowait. cephfs/flushing usually performs much better with the same conditions)

pool configuration:

root@lxt-prod-ceph-mon02:~# ceph osd pool ls detail
pool 6 'cephfs-metadata' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 last_change 12515 lfor 0/12412 flags hashpspool stripe_width 0 application cephfs
pool 9 'cephfs-data' erasure size 4 min_size 3 crush_rule 4 object_hash rjenkins pg_num 512 pgp_num 512 last_change 12482 lfor 12481/12481 flags hashpspool crash_replay_interval 45 tiers 17 read_tier 17 write_tier 17 stripe_width 4128 application cephfs
pool 17 'cephfs-cache' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 12553 lfor 12481/12481 flags hashpspool,incomplete_clones,noscrub,nodeep-scrub tier_of 9 cache_mode writeback target_bytes 200000000000 target_objects 150000 hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 180s x1 decay_rate 20 search_last_n 1 min_write_recency_for_promote 1 stripe_width 0

metadata and cache are both on the same ssd osd:

root@lxt-prod-ceph-mon02:~# ceph osd crush tree
ID CLASS WEIGHT   TYPE NAME              
-5        0.24399 root ssd               
 7   ssd  0.24399     osd.7              
-4        5.74399 host sinnlich          
 6   hdd  5.50000     osd.6              
 7   ssd  0.24399     osd.7              
-1       38.40399 root hdd               
-2       16.45799     host hn-lxt-ceph01 
 1   hdd  5.50000         osd.1          
 9   hdd  5.50000         osd.9          
12   hdd  5.50000         osd.12         
-3       16.44600     host hn-lxt-ceph02 
 2   hdd  5.50000         osd.2          
 3   hdd  5.50000         osd.3          
 4   hdd  2.72299         osd.4          
 5   hdd  2.72299         osd.5          
 6   hdd  5.50000     osd.6              

cache tier settings:

root@lxt-prod-ceph-mon02:~# ceph osd pool get cephfs-cache all
size: 1
min_size: 1
crash_replay_interval: 0
pg_num: 8
pgp_num: 8
crush_rule: replicated_ruleset
hashpspool: true
nodelete: false
nopgchange: false
nosizechange: false
write_fadvise_dontneed: false
noscrub: true
nodeep-scrub: true
hit_set_type: bloom
hit_set_period: 180
hit_set_count: 1
hit_set_fpp: 0.05
use_gmt_hitset: 1
auid: 0
target_max_objects: 150000
target_max_bytes: 200000000000
cache_target_dirty_ratio: 0.01
cache_target_dirty_high_ratio: 0.1
cache_target_full_ratio: 0.8
cache_min_flush_age: 60
cache_min_evict_age: 0
min_read_recency_for_promote: 0
min_write_recency_for_promote: 1
fast_read: 0
hit_set_grade_decay_rate: 20
hit_set_search_last_n: 1

(I'm not sure the values make much sense, I copied them from online examples and adapted them minimally if at all)

the mds shows no ops in flight, but the ssd osd shows a lot of those operations that seem to be slow (all of them with the same events timeline stopping at reached_pg):

root@sinnlich:~# ceph daemon  osd.7 dump_ops_in_flight| head -30
{
    "ops": [
        {
            "description": "osd_op(mds.0.3479:170284 17.1 17:98fc84de:::100002a830d.00000000:head [delete] snapc 1=[] ondisk+write+known_if_redirected+full_force e12553)",
            "initiated_at": "2018-05-23 21:27:00.140552",
            "age": 47.611064,
            "duration": 47.611077,
            "type_data": {
                "flag_point": "reached pg",
                "events": [
                    {
                        "time": "2018-05-23 21:27:00.140552",
                        "event": "initiated"
                    },
                    {
                        "time": "2018-05-23 21:27:00.140600",
                        "event": "queued_for_pg"
                    },
                    {
                        "time": "2018-05-23 21:27:00.140624",
                        "event": "reached_pg"
                    }
                ]
            }
        },
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com