slow ops on one osd makes all my buckets unavailable

Zhenshi Zhou <deaderzzs@xxxxxxxxx> · Tue, 28 Jul 2020 21:42:28 +0800

Hi,

My harbor registry uses ceph object storage to save the images. But
I couldn't pull/push images from harbor a few moments ago. Ceph was
in warning health status in the same time.

The cluster just had a warning message said that osd.24 has slow ops.
I check the ceph-osd.24.log, and showed as below:

*2020-07-28 19:01:40.599 7f907a39c700 -1 osd.24 4324 get_health_metrics
reporting 1 slow ops, oldest is osd_op(client.166289.0:34144787 17.4f
17:f29d8b20:::.dir.313c8244-fe4d-4d46-bf9b-0e33e46be041.157033.2:head [call
rgw.guard_bucket_resharding,call rgw.bucket_complete_op] snapc 0=[]
ondisk+write+known_if_redirected e4324)2020-07-28 19:01:41.558 7f907a39c700
-1 osd.24 4324 get_health_metrics reporting 1 slow ops, oldest is
osd_op(client.166289.0:34144787 17.4f
17:f29d8b20:::.dir.313c8244-fe4d-4d46-bf9b-0e33e46be041.157033.2:head [call
rgw.guard_bucket_resharding,call rgw.bucket_complete_op] snapc 0=[]
ondisk+write+known_if_redirected e4324)2020-07-28 19:01:42.579 7f907a39c700
-1 osd.24 4324 get_health_metrics reporting 1 slow ops, oldest is
osd_op(client.166289.0:34144787 17.4f
17:f29d8b20:::.dir.313c8244-fe4d-4d46-bf9b-0e33e46be041.157033.2:head [call
rgw.guard_bucket_resharding,call rgw.bucket_complete_op] snapc 0=[]
ondisk+write+known_if_redirected e4324)2020-07-28 19:01:43.566 7f907a39c700
-1 osd.24 4324 get_health_metrics reporting 1 slow ops, oldest is
osd_op(client.166289.0:34144787 17.4f
17:f29d8b20:::.dir.313c8244-fe4d-4d46-bf9b-0e33e46be041.157033.2:head [call
rgw.guard_bucket_resharding,call rgw.bucket_complete_op] snapc 0=[]
ondisk+write+known_if_redirected e4324)2020-07-28 19:01:44.588 7f907a39c700
-1 osd.24 4324 get_health_metrics reporting 2 slow ops, oldest is
osd_op(client.166289.0:34144787 17.4f
17:f29d8b20:::.dir.313c8244-fe4d-4d46-bf9b-0e33e46be041.157033.2:head [call
rgw.guard_bucket_resharding,call rgw.bucket_complete_op] snapc 0=[]
ondisk+write+known_if_redirected e4324)2020-07-28 19:01:45.627 7f907a39c700
-1 osd.24 4324 get_health_metrics reporting 2 slow ops, oldest is
osd_op(client.166289.0:34144787 17.4f
17:f29d8b20:::.dir.313c8244-fe4d-4d46-bf9b-0e33e46be041.157033.2:head [call
rgw.guard_bucket_resharding,call rgw.bucket_complete_op] snapc 0=[]
ondisk+write+known_if_redirected e4324)2020-07-28 19:01:46.674 7f907a39c700
-1 osd.24 4324 get_health_metrics reporting 2 slow ops, oldest is
osd_op(client.166289.0:34144787 17.4f
17:f29d8b20:::.dir.313c8244-fe4d-4d46-bf9b-0e33e46be041.157033.2:head [call
rgw.guard_bucket_resharding,call rgw.bucket_complete_op] snapc 0=[]
ondisk+write+known_if_redirected e4324)2020-07-28 19:01:47.701 7f907a39c700
-1 osd.24 4324 get_health_metrics reporting 1 slow ops, oldest is
osd_op(client.166289.0:34144852 17.4f
17:f29d8b20:::.dir.313c8244-fe4d-4d46-bf9b-0e33e46be041.157033.2:head [call
rgw.bucket_list] snapc 0=[] ondisk+read+known_if_redirected
e4324)2020-07-28 19:01:48.729 7f907a39c700 -1 osd.24 4324
get_health_metrics reporting 1 slow ops, oldest is
osd_op(client.166289.0:34144852 17.4f
17:f29d8b20:::.dir.313c8244-fe4d-4d46-bf9b-0e33e46be041.157033.2:head [call
rgw.bucket_list] snapc 0=[] ondisk+read+known_if_redirected
e4324)2020-07-28 19:01:49.729 7f907a39c700 -1 osd.24 4324
get_health_metrics reporting 2 slow ops, oldest is
osd_op(client.166289.0:34144852 17.4f
17:f29d8b20:::.dir.313c8244-fe4d-4d46-bf9b-0e33e46be041.157033.2:head [call
rgw.bucket_list] snapc 0=[] ondisk+read+known_if_redirected
e4324)2020-07-28 19:01:50.889 7f907a39c700 -1 osd.24 4324
get_health_metrics reporting 2 slow ops, oldest is
osd_op(client.166289.0:34144852 17.4f
17:f29d8b20:::.dir.313c8244-fe4d-4d46-bf9b-0e33e46be041.157033.2:head [call
rgw.bucket_list] snapc 0=[] ondisk+read+known_if_redirected e4324)......*
*......*

*2020-07-28 21:03:35.053 7f907a39c700 -1 osd.24 4324 get_health_metrics
reporting 46 slow ops, oldest is osd_op(client.166298.0:34904067 17.4f
17:f29d8b20:::.dir.313c8244-fe4d-4d46-bf9b-0e33e46be041.157033.2:head [call
rgw.bucket_list] snapc 0=[] ondisk+read+known_if_redirected e4324)*

After restarted osd.24, the cluster became health again, so was harbor.
What confuse me is that why my harbor couldn't get data from its bucket
while the log indicated that there was a client block on the other bucket.
I don't think some slow ops on an osd has any bad effect on all buckets.

Any idea is appreciated. Thanks
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx