Re: Slow Operations observed on one OSD (dedicated for RGW indexes), caused by problematic Placement Group

Igor Fedotov <ifedotov@xxxxxxx> · Fri, 10 Jan 2020 17:29:38 +0300



    Hi Piotr,
    wondering what's the device layout behind OSD/Bluestore? Do you
      have fast (SSD or NVMe) standalone device for DB/WAL at slow OSDs?
      Were OSDs in question deployed in  Nautilus or do they originate
      from the earlier Ceph releases?

      
      We've got multiple reports about slow omap operations at different
      clusters from our customers. In every case one had rotational-only
      (!) media behind OSD and RGW payload. I.e. slow DB device and high
      amount of metadata going to RocksDB. And  behavior most likely was
      triggered by a massive prior data removal. 
      Looks like RocksDB behind BlueStore has some performance gaps when
      dealing with highly fragmented DB at slow media. Nautilus has got
      some improvements in this respect but chances are this isn't
      enough.

    
    Manual RocksDB compaction seems to be a short-term workaround if
      the above is applicable. And the general advise would be to have
      faster device for DB.
    

    Thanks,
    Igor

    
    PS. The above doesn't mean that RGW is out of corresponding
      errors or/and is absolutely optimal. Most probably it has some
      areas for improvement which might resolve the issue as well. Which
      is likely to be an alternative approach for the fix. But I'm not
      aware of them.

    
    PS 2. Potentially the same RocksDB slowdown might also occur if
      OSD has large DB data spillover from fast DB device to slow main
      one. Nautilus has corresponding alerts available via ceph -s /
      ceph health commands.
    

    On 1/9/2020 6:33 PM, P. O. wrote:

    
      Hi All,

        
        I am observing strange behaviour on one OSD, which is dedicated
        for RGW index (pool: default.rgw.buckets.index).

        My ceph is: ceph version 14.2.4
        (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)

        
        In log files are many messages like below:

        2020-01-08 10:37:53.054 7f1fa22dd700  0
        bluestore(/var/lib/ceph/osd/ceph-62) log_latency slow operation
        observed for next, latency = 22.4641s

        ...

        2020-01-08 14:57:06.987 7f1fa22dd700  0
        bluestore(/var/lib/ceph/osd/ceph-62) log_latency_fn slow
        operation observed for upper_bound, latency = 28.9336s, after =
        �0_ omap_iterator(cid = 230.2c_head, oid =
#230:36f53ff9:::.dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31:head#)

        2020-01-08 15:17:39.972 7f1fa22dd700  0
        bluestore(/var/lib/ceph/osd/ceph-62) log_latency_fn slow
        operation observed for upper_bound, latency = 23.9168s, after =
        �0_ omap_iterator(cid = 230.2c_head, oid =
#230:36f53ff9:::.dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31:head#)

        2020-01-08 15:37:44.221 7f1fa62e5700  0
        bluestore(/var/lib/ceph/osd/ceph-62) log_latency_fn slow
        operation observed for upper_bound, latency = 25.8316s, after =
        �0_ omap_iterator(cid = 230.2c_head, oid =
#230:36f53ff9:::.dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31:head#)

        2020-01-08 15:58:12.279 7f1fa62e5700  0
        bluestore(/var/lib/ceph/osd/ceph-62) log_latency_fn slow
        operation observed for upper_bound, latency = 27.8207s, after =
        �0_ omap_iterator(cid = 230.2c_head, oid =
#230:36f53ff9:::.dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31:head#)

        
        I discovered that these times are associated with GC bilogs (it
        runs every 30 minutes).

        
        # /etc/ceph/ceph.conf

        rgw sync log trim interval = 1200

        
        ID of bucket bucket_1:
        12307ed8-2342-4293-9f14-12234220a234.13221775.224244

        Shard nr. 31 of bucket (bucket_1) index:
        .dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31

        
        [root@mon-1 ~]# ceph osd map default.rgw.buckets.index
        .dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31

        osdmap e23417 pool 'default.rgw.buckets.index' (230) object
        '.dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31'
        -> pg 230.9ffcaf6c (230.2c) -> up ([62,56,60], p62) acting
        ([62,56,60], p62)

        
        OSD: 62,56,60

        Problematic OSD: 62

        Problematic PG: 230.2c

        
        digging
              deeper..
        

        I checked that when I run the command:

        [root@mon-1 ~]# radosgw-admin bilog list --bucket bucket_1

        []

        
        Even when there are no bilogs, there are the same symptoms:

        
        1) In logs:

        2020-01-08 16:31:24.728 7f1fa22dd700  0
        bluestore(/var/lib/ceph/osd/ceph-62) log_latency_fn slow
        operation observed for upper_bound, latency = 28.653s, after =
        �0_ omap_iterator(cid = 230.2c_head, oid =
#230:36f53ff9:::.dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31:head#)

        
        2) Any other radosgw-admin commands are very slow while other
        process is iterating through bilogs (of this bucket).

        
        [dc-2 root@mon-1 ~]# time radosgw-admin bucket stats
        --bucket=bucket234

        ...

        real 0m25.641s

        user 0m0.148s

        sys 0m0.075s

        
        When i try to run deep scrub on this pg 230.2c, in logs appears:

        
        [root@mon-1 ~]# ceph pg deep-scrub 230.2c

        
        2020-01-08 16:38:32.551 7f1fc50f0700  1 heartbeat_map is_healthy
        'OSD::osd_op_tp thread 0x7f1fa22dd700' had timed out after 15

        ...

        2020-01-08 16:38:46.735 7f1fc60f2700  1 heartbeat_map is_healthy
        'OSD::osd_op_tp thread 0x7f1fa62e5700' had timed out after 15

        2020-01-08 16:38:46.735 7f1fc60f2700  1 heartbeat_map is_healthy
        'OSD::osd_op_tp thread 0x7f1fa22dd700' had timed out after 15

        2020-01-08 16:38:46.735 7f1fc60f2700  1 heartbeat_map is_healthy
        'OSD::osd_op_tp thread 0x7f1fa62e5700' had timed out after 15

        2020-01-08 16:38:46.911 7f1fa22dd700  0
        bluestore(/var/lib/ceph/osd/ceph-62) log_latency_fn slow
        operation observed for upper_bound, latency = 29.8655s, after =
        <80>0_ omap_iterator(cid = 230.2c_head, oid =
#230:36f53ff9:::.dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31:head#)

        2020-01-08 16:38:46.911 7f1fa22dd700  1 heartbeat_map
        reset_timeout 'OSD::osd_op_tp thread 0x7f1fa22dd700' had timed
        out after 15

        2020-01-08 16:38:46.911 7f1fa62e5700  1 heartbeat_map
        reset_timeout 'OSD::osd_op_tp thread 0x7f1fa62e5700' had timed
        out after 15

        
        2020-01-08 10:37:53.054 7f1fa22dd700  0
        bluestore(/var/lib/ceph/osd/ceph-62) log_latency slow operation
        observed for next, latency = 22.4641s

        
        When i try to iterate over listomapkeys for:
        .dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31

        [root@mon-1 ~]# rados -p default.rgw.buckets.index listomapkeys
        .dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31

        In logs i observe: 

        
        2020-01-09 11:08:58.847 7f1fa62e5700  0
        bluestore(/var/lib/ceph/osd/ceph-62) log_latency slow operation
        observed for next, latency = 24.1949s

        
        The same for listomapvals

        
        [root@mon-1 ~]# rados -p default.rgw.buckets.index listomapvals
        .dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31

        
        2020-01-09 11:11:36.508 7f1fa62e5700  0
        bluestore(/var/lib/ceph/osd/ceph-62) log_latency slow operation
        observed for next, latency = 28.2137s

        
        I tried to purge and recreate this OSD, also I trimmed bilogs
        for this bucket (radosgw-admin bilog trim --bucket bucket_1),
        but this didn't help.

        
        I suspect that any operations on PG: 230.2c on OSD: 62 can
        causing slow operations on my cluster.

        I have no idea how to debug/resolve this problem. Any
          help/hint is appreciated!
        

        Thanks,

        Piotr

      
      _______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

    
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx