Slow Request on only one PG, every day between 0:00 and 2:00 UTC

Sven Anders <sanders@xxxxxxxxxxxxxxx> · Tue, 27 Jul 2021 10:18:37 +0200

Hi,

we are operating a ceph / OpenStack cluster at ScaleUp and have slow request 
only between 0:00 and 2:00 UTC  every day on one PG. At the rest of the time 
it is operating without any issue.

I'm new with ceph and this is my first post to this ML, so please be kind.

ceph pg map 5.40
osdmap e30892 pg 5.40 (5.40) -> up [29,20,32] acting [29,20,32]

ceph --version
ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351) nautilus 
(stable)

We created a script which is calling:
  /usr/bin/ceph  daemon osd.29 dump_historic_ops >> /home/osd29-ops 
  and
  /usr/bin/ceph  daemon osd.20 dump_historic_ops >> /home/osd20-ops 
every 20 seconds.

Here is one example:

-- snip --
           "description": "osd_op(client.3144268599.0:7872481 5.40 
5:03b6209d:::rbd_data.7577bc66334873.000000000000000e:head
 [stat,write 1064960~4096] snapc ba=[] ondisk+write+known_if_redirected 
e30892)",
            "initiated_at": "2021-07-14 22:00:14.196286",
            "age": 27.823148335999999,
            "duration": 18.782248133,
            "type_data": {
                "flag_point": "commit sent; apply or cleanup",
                "client_info": {
                    "client": "client.3144268599",
                    "client_addr": "v1:10.23.181.48:0/768948222",
                    "tid": 7872481
                },
                "events": [
                    {
                        "time": "2021-07-14 22:00:14.196286",
                        "event": "initiated"
                    },
                    {
                        "time": "2021-07-14 22:00:14.196286",
                        "event": "header_read"
                    },
                    {
                        "time": "2021-07-14 22:00:14.196287",
                        "event": "throttled"
                    },
                    {
                        "time": "2021-07-14 22:00:14.196302",
                        "event": "all_read"
                    },
                    {
                        "time": "2021-07-14 22:00:14.196303",
                        "event": "dispatched"
                    },
                    {
                        "time": "2021-07-14 22:00:14.196305",
                        "event": "queued_for_pg"
                    },
                    {
                        "time": "2021-07-14 22:00:14.196460",
                        "event": "reached_pg"
                    },
                    {
                        "time": "2021-07-14 22:00:14.196470",
                        "event": "waiting for rw locks"
                    },
                    {
                        "time": "2021-07-14 22:00:17.999868",
                        "event": "reached_pg"
                    },
... 
(more "reached_pg" and "waiting for rw locks" events)
...
                        "time": "2021-07-14 22:00:32.662470",
                        "event": "waiting for rw locks"
                    },
                    {
                        "time": "2021-07-14 22:00:32.885560",
                        "event": "reached_pg"
                    },
                    {
                        "time": "2021-07-14 22:00:32.885588",
                        "event": "started"
                    },
                    {
                        "time": "2021-07-14 22:00:32.885661",
                        "event": "waiting for subops from 20,32"
                    },
                    {
                        "time": "2021-07-14 22:00:32.886279",
                        "event": "op_commit"
                    },
                    {
                        "time": "2021-07-14 22:00:32.973286",
                        "event": "sub_op_commit_rec"
                    },
                    {
                        "time": "2021-07-14 22:00:32.978367",
                        "event": "sub_op_commit_rec"
                    },
                    {
                        "time": "2021-07-14 22:00:32.978411",
                        "event": "commit_sent"
                    },
                    {
                        "time": "2021-07-14 22:00:32.978534",
                        "event": "done"
                    }
                ]
            }
        },
-- snap --

grep "waiting for subops"  /home/osd29-ops |grep 20 |sort |uniq -c
    930                         "event": "waiting for subops from 20,30"
   7355                         "event": "waiting for subops from 20,32"
   4862                         "event": "waiting for subops from 20,58"

I have found a similar question on the ceph mailing list in 2019.

https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/
E6L3MSTC74S5HUZVQ7XG4CFPKVNJDTQI/

which was answered by  Wesley Peng

-- snip --
There are too many logs for "waiting for rw locks" that indicates the 
system is busy. Maybe you want to scaling more OSDs to improve the 
performance.
-- snap --

Is there a way to find out which kind of the system is to busy? Is this Disk-
Io? 
An we wonder why is it always only this one PG and no other OSD, and how to 
deal with this. If we add an other osd, the PG will perhaps move to an other 
osd, but it will still exists and have this issue.

Any help appreciated

Best Regards

Sven
-- 
—
=================================================
ScaleUp Technologies GmbH & Co. KG
Suederstrasse 198
20537 Hamburg
Germany

Tel.: +49 40 59380500
Fax: +49 40 59380260

Registered Office: Hamburg
Commercial Register Hamburg, HRA 90445

General Partner: ScaleUp Management GmbH
Registered Office: Hamburg
Commercial Register Hamburg, HRB 91902
Directors: Christoph Streit, Gihan Behrmann

www.scaleuptech.com
=================================================

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx