Re: Slow Operations observed on one OSD (dedicated for RGW indexes), caused by problematic Placement Group

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Casey, Igor
thank you for your response


Casey:

This cluster is secondary zone in Multi-Site setup (placement pools have been set as "default.")

[root@mon-1 ~]# radosgw-admin zone get
{
    "id": "96b9adcb-fdc0-4f36-a310-89347b74ea67",
    "name": "secondary_zone",
    "placement_pools": [
        {
            "key": "default-placement",
            "val": {
                "index_pool": "default.rgw.buckets.index",


Igor:

All of the osd are backed by bluestore. There are NVME disks dedicated for index pool. I dont use separated db and wal devices.
Creation command: ceph-volume --cluster ceph lvm create --bluestore --data /dev/nvme0n1p9

These osds were deployed in Mimic (v13.2.5).

Last Friday, i upgraded cluster to 14.2.6, and I purged and recreated this osd. And I don't see this problem anymore.
I hope this helped.




Below, I paste the result of debugging the problem:


1) listing omap keys
[root@mon-1 ~]# time rados -p default.rgw.buckets.index listomapkeys .dir.99a07ed8-2112-429b-9f94-81383220a95b.33721775.254244.31 > omapkeys.list

real 0m29.475s
user 0m0.123s
sys 0m0.131s

[root@osd-n57 ~]# tail -f /var/log/ceph/ceph-osd.62.log |grep "slow operation "
2020-01-10 09:50:36.274 7f1fa22dd700  0 bluestore(/var/lib/ceph/osd/ceph-62) log_latency slow operation observed for next, latency = 29.0901s


2) listing omap vals
[dc-2 root@mon-1 tmp]# time rados -p default.rgw.buckets.index listomapvals .dir.99a07ed8-2112-429b-9f94-81383220a95b.33721775.254244.31 > omapvals.list

real 0m29.080s
user 0m2.801s
sys 0m0.270s

[root@osd-n57 ~]# tail -f /var/log/ceph/ceph-osd.62.log |grep "slow operation "
2020-01-10 09:52:37.164 7f1fa22dd700  0 bluestore(/var/lib/ceph/osd/ceph-62) log_latency slow operation observed for next, latency = 25.6445s


3) listing bilogs for this bucket (ID: 99a07ed8-2112-429b-9f94-81383220a95b.33721775.254244)
[root@mon-1 ~]# radosgw-admin bilog list --bucket bucket_1  #(the same effect)

[root@osd-n57 ~]# tail -f /var/log/ceph/ceph-osd.62.log |grep "slow operation "
2020-01-10 10:01:27.896 7f1fa62e5700  0 bluestore(/var/lib/ceph/osd/ceph-62) log_latency_fn slow operation observed for upper_bound, latency = 26.7958s, after = �0_ omap_iterator(cid = 230.2c_head, oid = #230:36f53ff9:::.dir.99a07ed8-2112-429b-9f94-81383220a95b.33721775.254244.31:head#)
2020-01-10 10:01:28.981 7f1fbfb18700 -1 osd.62 23457 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.200549558.0:308265358 230.2c 230:35a96e8b:::.dir.99a07ed8-2112-429b-9f94-81383220a95b.49439529.1.85:head [call rgw.bi_log_trim] snapc 0=[] ondisk+write+known_if_redirected e23457)

4) bilogs GC, while trimming bucket index logs from this bucket: .dir.99a07ed8-2112-429b-9f94-81383220a95b.33721775.254244.31 (the same effect) as above

I checked that a lot of operations in cluster are blocked while some other process is touching omap shard: .dir.99a07ed8-2112-429b-9f94-81383220a95b.33721775.254244.31.
For example s3cmd listing:


First scenario: List objects from random bucket, with using s3cmd.

[root@po:~ ]$ time s3cmd -c .s3cfg.n123 --no-ssl ls s3://bucket_n123/ --limit 1 -r
2019-09-10 10:38    511600   s3://bucket_n123/BnxKLbY2mN7Jw5b4pdkiU.jpg

real 0m0,244s
user 0m0,142s
sys 0m0,041s



Second scenario: List objects from random bucket, with using s3cmd + listing omapkeys of.dir.99a07ed8-2112-429b-9f94-81383220a95b.33721775.254244.31 (at the same time):

[root@mon-1 ~]# time rados -p default.rgw.buckets.index listomapkeys .dir.99a07ed8-2112-429b-9f94-81383220a95b.33721775.254244.31

[root@po:~ ]$ time s3cmd -c .s3cfg.n123 --no-ssl ls s3://bucket_n123/ --limit 1 -r
2019-09-10 10:38    511600   s3://bucket_n123/BnxKLbY2mN7Jw5b4pdkiU.jpg

real 0m25,802s
user 0m0,211s
sys 0m0,040s

[root@osd-n57 ~]# tail -f /var/log/ceph/ceph-osd.62.log |grep "slow operation "
2020-01-10 10:08:40.942 7f1fbfb18700 -1 osd.62 23480 get_health_metrics reporting 60 slow ops, oldest is osd_op(client.200549558.0:313502542 230.2c 230:34aaa3d7:::.dir.99a07ed8-2112-429b-9f94-81383220a95b.51788790.1.2:head [call rgw.guard_bucket_resharding,call rgw.bucket_prepare_op] snapc 0=[] ondisk+write+known_if_redirected e23480)


Thanks,
Piotr


pt., 10 sty 2020 o 15:30 Igor Fedotov <ifedotov@xxxxxxx> napisał(a):

Hi Piotr,

wondering what's the device layout behind OSD/Bluestore? Do you have fast (SSD or NVMe) standalone device for DB/WAL at slow OSDs? Were OSDs in question deployed in  Nautilus or do they originate from the earlier Ceph releases?

We've got multiple reports about slow omap operations at different clusters from our customers. In every case one had rotational-only (!) media behind OSD and RGW payload. I.e. slow DB device and high amount of metadata going to RocksDB. And  behavior most likely was triggered by a massive prior data removal.  Looks like RocksDB behind BlueStore has some performance gaps when dealing with highly fragmented DB at slow media. Nautilus has got some improvements in this respect but chances are this isn't enough.

Manual RocksDB compaction seems to be a short-term workaround if the above is applicable. And the general advise would be to have faster device for DB.


Thanks,

Igor

PS. The above doesn't mean that RGW is out of corresponding errors or/and is absolutely optimal. Most probably it has some areas for improvement which might resolve the issue as well. Which is likely to be an alternative approach for the fix. But I'm not aware of them.

PS 2. Potentially the same RocksDB slowdown might also occur if OSD has large DB data spillover from fast DB device to slow main one. Nautilus has corresponding alerts available via ceph -s / ceph health commands.


On 1/9/2020 6:33 PM, P. O. wrote:
Hi All,

I am observing strange behaviour on one OSD, which is dedicated for RGW index (pool: default.rgw.buckets.index).
My ceph is: ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)

In log files are many messages like below:
2020-01-08 10:37:53.054 7f1fa22dd700  0 bluestore(/var/lib/ceph/osd/ceph-62) log_latency slow operation observed for next, latency = 22.4641s
...
2020-01-08 14:57:06.987 7f1fa22dd700  0 bluestore(/var/lib/ceph/osd/ceph-62) log_latency_fn slow operation observed for upper_bound, latency = 28.9336s, after = �0_ omap_iterator(cid = 230.2c_head, oid = #230:36f53ff9:::.dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31:head#)
2020-01-08 15:17:39.972 7f1fa22dd700  0 bluestore(/var/lib/ceph/osd/ceph-62) log_latency_fn slow operation observed for upper_bound, latency = 23.9168s, after = �0_ omap_iterator(cid = 230.2c_head, oid = #230:36f53ff9:::.dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31:head#)
2020-01-08 15:37:44.221 7f1fa62e5700  0 bluestore(/var/lib/ceph/osd/ceph-62) log_latency_fn slow operation observed for upper_bound, latency = 25.8316s, after = �0_ omap_iterator(cid = 230.2c_head, oid = #230:36f53ff9:::.dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31:head#)
2020-01-08 15:58:12.279 7f1fa62e5700  0 bluestore(/var/lib/ceph/osd/ceph-62) log_latency_fn slow operation observed for upper_bound, latency = 27.8207s, after = �0_ omap_iterator(cid = 230.2c_head, oid = #230:36f53ff9:::.dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31:head#)

I discovered that these times are associated with GC bilogs (it runs every 30 minutes).

# /etc/ceph/ceph.conf
rgw sync log trim interval = 1200


ID of bucket bucket_1: 12307ed8-2342-4293-9f14-12234220a234.13221775.224244
Shard nr. 31 of bucket (bucket_1) index: .dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31

[root@mon-1 ~]# ceph osd map default.rgw.buckets.index .dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31
osdmap e23417 pool 'default.rgw.buckets.index' (230) object '.dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31' -> pg 230.9ffcaf6c (230.2c) -> up ([62,56,60], p62) acting ([62,56,60], p62)

OSD: 62,56,60
Problematic OSD: 62
Problematic PG: 230.2c


digging deeper..

I checked that when I run the command:
[root@mon-1 ~]# radosgw-admin bilog list --bucket bucket_1
[]

Even when there are no bilogs, there are the same symptoms:

1) In logs:
2020-01-08 16:31:24.728 7f1fa22dd700  0 bluestore(/var/lib/ceph/osd/ceph-62) log_latency_fn slow operation observed for upper_bound, latency = 28.653s, after = �0_ omap_iterator(cid = 230.2c_head, oid = #230:36f53ff9:::.dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31:head#)

2) Any other radosgw-admin commands are very slow while other process is iterating through bilogs (of this bucket).

[dc-2 root@mon-1 ~]# time radosgw-admin bucket stats --bucket=bucket234
...
real 0m25.641s
user 0m0.148s
sys 0m0.075s

When i try to run deep scrub on this pg 230.2c, in logs appears:

[root@mon-1 ~]# ceph pg deep-scrub 230.2c

2020-01-08 16:38:32.551 7f1fc50f0700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1fa22dd700' had timed out after 15
...
2020-01-08 16:38:46.735 7f1fc60f2700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1fa62e5700' had timed out after 15
2020-01-08 16:38:46.735 7f1fc60f2700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1fa22dd700' had timed out after 15
2020-01-08 16:38:46.735 7f1fc60f2700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1fa62e5700' had timed out after 15
2020-01-08 16:38:46.911 7f1fa22dd700  0 bluestore(/var/lib/ceph/osd/ceph-62) log_latency_fn slow operation observed for upper_bound, latency = 29.8655s, after = <80>0_ omap_iterator(cid = 230.2c_head, oid = #230:36f53ff9:::.dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31:head#)
2020-01-08 16:38:46.911 7f1fa22dd700  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f1fa22dd700' had timed out after 15
2020-01-08 16:38:46.911 7f1fa62e5700  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f1fa62e5700' had timed out after 15

2020-01-08 10:37:53.054 7f1fa22dd700  0 bluestore(/var/lib/ceph/osd/ceph-62) log_latency slow operation observed for next, latency = 22.4641s


When i try to iterate over listomapkeys for: .dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31
[root@mon-1 ~]# rados -p default.rgw.buckets.index listomapkeys .dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31
In logs i observe:

2020-01-09 11:08:58.847 7f1fa62e5700  0 bluestore(/var/lib/ceph/osd/ceph-62) log_latency slow operation observed for next, latency = 24.1949s

The same for listomapvals

[root@mon-1 ~]# rados -p default.rgw.buckets.index listomapvals .dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31

2020-01-09 11:11:36.508 7f1fa62e5700  0 bluestore(/var/lib/ceph/osd/ceph-62) log_latency slow operation observed for next, latency = 28.2137s


I tried to purge and recreate this OSD, also I trimmed bilogs for this bucket (radosgw-admin bilog trim --bucket bucket_1), but this didn't help.

I suspect that any operations on PG: 230.2c on OSD: 62 can causing slow operations on my cluster.
I have no idea how to debug/resolve this problem. Any help/hint is appreciated!


Thanks,
Piotr

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux