ceph status intermittently outputs "0 slow ops"

大神祐真 <yuma.ogami.cybozu@xxxxxxxxx> · Thu, 23 Dec 2021 15:31:20 +0900

Hi,

The "ceph status" intermittently shows "0 slow ops" . Could you tell
me how should
I handle this problem and what does "0 slow ops" mean?

I investigated by referring the following documents, but no luck.

https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-osd/#debugging-slow-requests
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-single/troubleshooting_guide/index#slow-requests-or-requests-are-blocked_diag

Here is the result of my investigation.

The outputs of "ceph status" are HEALTH_OK or the following HEALTH_WARN.

```
$ ceph status
  cluster:
    id:     b52d5f3d-ba14-442e-a089-0bca47b83758
    health: HEALTH_WARN
            0 slow ops, oldest one blocked for 36 sec, osd.0 has slow ops

  services:
    mon: 3 daemons, quorum a,b,c (age 8h)
    mgr: a(active, since 17h), standbys: b
    osd: 24 osds: 24 up (since 8h), 24 in (since 8h)
    rgw: 3 daemons active (3 hosts, 1 zones)

  data:
    pools:   8 pools, 592 pgs
    objects: 8.87M objects, 270 GiB
    usage:   948 GiB used, 132 TiB / 132 TiB avail
    pgs:     592 active+clean

  io:
    client:   49 KiB/s rd, 44 KiB/s wr, 47 op/s rd, 22 op/s wr
```

The device for osd.0 doesn't have any problem as a perspective of SMART.

```
$ sudo nsenter -t 3216173 -m nvme smart-log /host/dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning                    : 0
temperature                         : 18 C
available_spare                     : 99%
available_spare_threshold           : 10%
percentage_used                     : 0%
data_units_read                     : 2538919
data_units_written                  : 34395997
host_read_commands                  : 62385481
host_write_commands                 : 938527838
controller_busy_time                : 90
power_cycles                        : 27
power_on_hours                      : 30430
unsafe_shutdowns                    : 23
media_errors                        : 0
num_err_log_entries                 : 0
Warning Temperature Time            : 0
Critical Composite Temperature Time : 0
Thermal Management T1 Trans Count   : 0
Thermal Management T2 Trans Count   : 0
Thermal Management T1 Total Time    : 0
Thermal Management T2 Total Time    : 0
```

Detailed logs about slow ops:

```
debug 2021-12-23T05:24:18.035+0000 7f85f2cfe700 -1 osd.0 6106
get_health_metrics reporting 1 slow ops, oldest is
osd_op(client.7493316.0:443601 5.2
5:59ed161b:::.dir.67ade7e7-b37b-486e-a969-5d39c507d06d.3619433.1.80:head
[call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op
in=301b] snapc 0=[] ondisk+write+known_if_redirected e6106)
```

All "ceph daemon osd.0
dump_{dump_ops_in_flight,blocked_ops,historic_ops} have empty "ops".

```
$ kubectl exec -n ceph-object-store deploy/rook-ceph-osd-0 -- ceph
daemon osd.0 dump_ops_in_flight
Defaulted container "osd" out of: osd, blkdevmapper (init), activate
(init), expand-bluefs (init), chown-container-data-dir (init)
{
    "ops": [],
    "num_ops": 0
}
```

```
$ kubectl exec -n ceph-object-store deploy/rook-ceph-osd-0 -- ceph
daemon osd.0 dump_blocked_ops
Defaulted container "osd" out of: osd, blkdevmapper (init), activate
(init), expand-bluefs (init), chown-container-data-dir (init)
{
    "ops": [],
    "complaint_time": 30,
    "num_blocked_ops": 0
}
```

```
$ kubectl exec -n ceph-object-store deploy/rook-ceph-osd-0 -- ceph
daemon osd.0 dump_historic_ops
Defaulted container "osd" out of: osd, blkdevmapper (init), activate
(init), expand-bluefs (init), chown-container-data-dir (init)
{
    "size": 20,
    "duration": 600,
    "ops": []
}
```
The bucket index resharding doesn't work in progress.

```
$ kubectl exec -n ceph-object-store deployment/rook-ceph-tools --
radosgw-admin reshard list
[]
```

The I/O performance of both pools and the above-mentioned device seem
to be fine.

```
$ kubectl exec -n ceph-object-store deploy/rook-ceph-tools -- ceph osd
pool stats
pool device_health_metrics id 1
  nothing is going on

pool ceph-object-store-0.rgw.control id 2
  nothing is going on

pool ceph-object-store-0.rgw.meta id 3
  client io 255 B/s rd, 85 B/s wr, 0 op/s rd, 0 op/s wr

pool ceph-object-store-0.rgw.log id 4
  client io 14 KiB/s rd, 170 B/s wr, 13 op/s rd, 9 op/s wr

pool ceph-object-store-0.rgw.buckets.index id 5
  client io 32 KiB/s rd, 6.3 KiB/s wr, 32 op/s rd, 9 op/s wr

pool ceph-object-store-0.rgw.buckets.non-ec id 6
  nothing is going on

pool .rgw.root id 7
  nothing is going on

pool ceph-object-store-0.rgw.buckets.data id 8
  client io 10 KiB/s rd, 44 KiB/s wr, 8 op/s rd, 34 op/s wr
```

Software versions:
- Rook: 1.7.7
- Ceph: 16.2.6

Thanks,
Yuma
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx