Hi, we are operating a ceph / OpenStack cluster at ScaleUp and have slow request only between 0:00 and 2:00 UTC every day on one PG. At the rest of the time it is operating without any issue. I'm new with ceph and this is my first post to this ML, so please be kind. ceph pg map 5.40 osdmap e30892 pg 5.40 (5.40) -> up [29,20,32] acting [29,20,32] ceph --version ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351) nautilus (stable) We created a script which is calling: /usr/bin/ceph daemon osd.29 dump_historic_ops >> /home/osd29-ops and /usr/bin/ceph daemon osd.20 dump_historic_ops >> /home/osd20-ops every 20 seconds. Here is one example: -- snip -- "description": "osd_op(client.3144268599.0:7872481 5.40 5:03b6209d:::rbd_data.7577bc66334873.000000000000000e:head [stat,write 1064960~4096] snapc ba=[] ondisk+write+known_if_redirected e30892)", "initiated_at": "2021-07-14 22:00:14.196286", "age": 27.823148335999999, "duration": 18.782248133, "type_data": { "flag_point": "commit sent; apply or cleanup", "client_info": { "client": "client.3144268599", "client_addr": "v1:10.23.181.48:0/768948222", "tid": 7872481 }, "events": [ { "time": "2021-07-14 22:00:14.196286", "event": "initiated" }, { "time": "2021-07-14 22:00:14.196286", "event": "header_read" }, { "time": "2021-07-14 22:00:14.196287", "event": "throttled" }, { "time": "2021-07-14 22:00:14.196302", "event": "all_read" }, { "time": "2021-07-14 22:00:14.196303", "event": "dispatched" }, { "time": "2021-07-14 22:00:14.196305", "event": "queued_for_pg" }, { "time": "2021-07-14 22:00:14.196460", "event": "reached_pg" }, { "time": "2021-07-14 22:00:14.196470", "event": "waiting for rw locks" }, { "time": "2021-07-14 22:00:17.999868", "event": "reached_pg" }, ... (more "reached_pg" and "waiting for rw locks" events) ... "time": "2021-07-14 22:00:32.662470", "event": "waiting for rw locks" }, { "time": "2021-07-14 22:00:32.885560", "event": "reached_pg" }, { "time": "2021-07-14 22:00:32.885588", "event": "started" }, { "time": "2021-07-14 22:00:32.885661", "event": "waiting for subops from 20,32" }, { "time": "2021-07-14 22:00:32.886279", "event": "op_commit" }, { "time": "2021-07-14 22:00:32.973286", "event": "sub_op_commit_rec" }, { "time": "2021-07-14 22:00:32.978367", "event": "sub_op_commit_rec" }, { "time": "2021-07-14 22:00:32.978411", "event": "commit_sent" }, { "time": "2021-07-14 22:00:32.978534", "event": "done" } ] } }, -- snap -- grep "waiting for subops" /home/osd29-ops |grep 20 |sort |uniq -c 930 "event": "waiting for subops from 20,30" 7355 "event": "waiting for subops from 20,32" 4862 "event": "waiting for subops from 20,58" I have found a similar question on the ceph mailing list in 2019. https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/ E6L3MSTC74S5HUZVQ7XG4CFPKVNJDTQI/ which was answered by Wesley Peng -- snip -- There are too many logs for "waiting for rw locks" that indicates the system is busy. Maybe you want to scaling more OSDs to improve the performance. -- snap -- Is there a way to find out which kind of the system is to busy? Is this Disk- Io? An we wonder why is it always only this one PG and no other OSD, and how to deal with this. If we add an other osd, the PG will perhaps move to an other osd, but it will still exists and have this issue. Any help appreciated Best Regards Sven -- — ================================================= ScaleUp Technologies GmbH & Co. KG Suederstrasse 198 20537 Hamburg Germany Tel.: +49 40 59380500 Fax: +49 40 59380260 Registered Office: Hamburg Commercial Register Hamburg, HRA 90445 General Partner: ScaleUp Management GmbH Registered Office: Hamburg Commercial Register Hamburg, HRB 91902 Directors: Christoph Streit, Gihan Behrmann www.scaleuptech.com ================================================= _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx