Re: Slow Request on only one PG, every day between 0:00 and 2:00 UTC

mahnoosh shahidi <mahnooosh.shd@xxxxxxxxx> · Tue, 27 Jul 2021 16:34:57 +0430

Hi Sven,

We had the same problem in our cluster. In our case it was the lifecycle
operation that runs at 00:00 every day and 2 hours after that the garbage
collector runs to delete objects. We could figure this out by monitoring
the number of objects in Rados and rgw. Hope it helps.

On Tue, Jul 27, 2021 at 12:49 PM Sven Anders <sanders@xxxxxxxxxxxxxxx>
wrote:

> Hi,
>
> we are operating a ceph / OpenStack cluster at ScaleUp and have slow
> request
> only between 0:00 and 2:00 UTC  every day on one PG. At the rest of the
> time
> it is operating without any issue.
>
>
> I'm new with ceph and this is my first post to this ML, so please be kind.
>
>
> ceph pg map 5.40
> osdmap e30892 pg 5.40 (5.40) -> up [29,20,32] acting [29,20,32]
>
> ceph --version
> ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351) nautilus
> (stable)
>
>
> We created a script which is calling:
>   /usr/bin/ceph  daemon osd.29 dump_historic_ops >> /home/osd29-ops
>   and
>   /usr/bin/ceph  daemon osd.20 dump_historic_ops >> /home/osd20-ops
> every 20 seconds.
>
> Here is one example:
>
> -- snip --
>            "description": "osd_op(client.3144268599.0:7872481 5.40
> 5:03b6209d:::rbd_data.7577bc66334873.000000000000000e:head
>  [stat,write 1064960~4096] snapc ba=[] ondisk+write+known_if_redirected
> e30892)",
>             "initiated_at": "2021-07-14 22:00:14.196286",
>             "age": 27.823148335999999,
>             "duration": 18.782248133,
>             "type_data": {
>                 "flag_point": "commit sent; apply or cleanup",
>                 "client_info": {
>                     "client": "client.3144268599",
>                     "client_addr": "v1:10.23.181.48:0/768948222",
>                     "tid": 7872481
>                 },
>                 "events": [
>                     {
>                         "time": "2021-07-14 22:00:14.196286",
>                         "event": "initiated"
>                     },
>                     {
>                         "time": "2021-07-14 22:00:14.196286",
>                         "event": "header_read"
>                     },
>                     {
>                         "time": "2021-07-14 22:00:14.196287",
>                         "event": "throttled"
>                     },
>                     {
>                         "time": "2021-07-14 22:00:14.196302",
>                         "event": "all_read"
>                     },
>                     {
>                         "time": "2021-07-14 22:00:14.196303",
>                         "event": "dispatched"
>                     },
>                     {
>                         "time": "2021-07-14 22:00:14.196305",
>                         "event": "queued_for_pg"
>                     },
>                     {
>                         "time": "2021-07-14 22:00:14.196460",
>                         "event": "reached_pg"
>                     },
>                     {
>                         "time": "2021-07-14 22:00:14.196470",
>                         "event": "waiting for rw locks"
>                     },
>                     {
>                         "time": "2021-07-14 22:00:17.999868",
>                         "event": "reached_pg"
>                     },
> ...
> (more "reached_pg" and "waiting for rw locks" events)
> ...
>                         "time": "2021-07-14 22:00:32.662470",
>                         "event": "waiting for rw locks"
>                     },
>                     {
>                         "time": "2021-07-14 22:00:32.885560",
>                         "event": "reached_pg"
>                     },
>                     {
>                         "time": "2021-07-14 22:00:32.885588",
>                         "event": "started"
>                     },
>                     {
>                         "time": "2021-07-14 22:00:32.885661",
>                         "event": "waiting for subops from 20,32"
>                     },
>                     {
>                         "time": "2021-07-14 22:00:32.886279",
>                         "event": "op_commit"
>                     },
>                     {
>                         "time": "2021-07-14 22:00:32.973286",
>                         "event": "sub_op_commit_rec"
>                     },
>                     {
>                         "time": "2021-07-14 22:00:32.978367",
>                         "event": "sub_op_commit_rec"
>                     },
>                     {
>                         "time": "2021-07-14 22:00:32.978411",
>                         "event": "commit_sent"
>                     },
>                     {
>                         "time": "2021-07-14 22:00:32.978534",
>                         "event": "done"
>                     }
>                 ]
>             }
>         },
> -- snap --
>
> grep "waiting for subops"  /home/osd29-ops |grep 20 |sort |uniq -c
>     930                         "event": "waiting for subops from 20,30"
>    7355                         "event": "waiting for subops from 20,32"
>    4862                         "event": "waiting for subops from 20,58"
>
> I have found a similar question on the ceph mailing list in 2019.
>
> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/
> E6L3MSTC74S5HUZVQ7XG4CFPKVNJDTQI/
>
> which was answered by  Wesley Peng
>
> -- snip --
> There are too many logs for "waiting for rw locks" that indicates the
> system is busy. Maybe you want to scaling more OSDs to improve the
> performance.
> -- snap --
>
> Is there a way to find out which kind of the system is to busy? Is this
> Disk-
> Io?
> An we wonder why is it always only this one PG and no other OSD, and how
> to
> deal with this. If we add an other osd, the PG will perhaps move to an
> other
> osd, but it will still exists and have this issue.
>
> Any help appreciated
>
> Best Regards
>
> Sven
> --
> —
> =================================================
> ScaleUp Technologies GmbH & Co. KG
> Suederstrasse 198
> 20537 Hamburg
> Germany
>
> Tel.: +49 40 59380500
> Fax: +49 40 59380260
>
> Registered Office: Hamburg
> Commercial Register Hamburg, HRA 90445
>
> General Partner: ScaleUp Management GmbH
> Registered Office: Hamburg
> Commercial Register Hamburg, HRB 91902
> Directors: Christoph Streit, Gihan Behrmann
>
> www.scaleuptech.com
> =================================================
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx