Re: OSD SLOW_OPS is filling MONs disk space

Eugen Block <eblock@xxxxxx> · Wed, 23 Feb 2022 10:41:09 +0000

Hi,

How can I identify which operation this OSD is trying to achieve as
osd_op() is a bit large ^^ ?

I would start by querying the OSD for historic_slow_ops:

ceph daemon osd.<OSD> dump_historic_slow_ops to see which operation it is.

How can I identify the related images to this data chunk?

You could go through all rbd images and check for the line containing  
block_name_prefix, this could take some time depending on how many  
images you have:

        block_name_prefix: rbd_data.ca69416b8b4567

I sometimes do that with this for loop:

for i in `rbd -p <POOL> ls`; do if [ $(rbd info <POOL>/$i | grep -c  
<PREFIX>) -gt 0 ]; then echo "image: $i"; break; fi; done

So in your case it would look something like this:

for i in `rbd -p <POOL> ls`; do if [ $(rbd info <POOL>/$i | grep -c  
89a4a940aba90b -gt 0 ]; then echo "image: $i"; break; fi; done

To see which clients are connected you can check the mon daemon:

ceph daemon mon.<MON> sessions

The mon daemon also has a history of slow ops:

ceph daemon mon.<MON> dump_historic_slow_ops

Regards,
Eugen

Zitat von Gaël THEROND <gael.therond@xxxxxxxxxxxx>:

Hi everyone, I'm having a really nasty issue since around two days where
our cluster report a bunch of SLOW_OPS on one of our OSD as:

https://paste.openstack.org/show/b3DkgnJDVx05vL5o4OmY/

Here is the cluster specification:
  * Used to store Openstack related data (VMs/Snaphots/Volumes/Swift).
  * Based on CEPH Nautilus 14.2.8 installed using ceph-ansible.
  * Use an EC based storage profile.
  * We have a separate and dedicated frontend and backend 10Gbps network.
  * We don't have any network issues observed or reported by our monitoring
system.

Here is our current cluster status:
https://paste.openstack.org/show/biVnkm9Yyog3lmSUn0UK/
Here is a detailed view of our cluster status:
https://paste.openstack.org/show/bgKCSVuow0JUZITo2Ndj/

My main issue here is that this health alert is starting to fill the
Monitor's disk and so trigger a MON_DISK_BIG alert.

I'm worried as I'm having a hard time to identify which osd operation is
actually slow and especially, which image does it concern and which client
is using it.

So far I've try:
  * To match this client ID with any watcher of our stored
volumes/vms/snaphots by extracting the whole list and then using the
following command: *rbd status <pool>/<image>*
     Unfortunately none of the watchers is matching my reported client from
the OSD on any pool.

*  * *To map this reported chunk of data to any of our store image
using:  *ceph
osd map <pool>/rbd_data.5.89a4a940aba90b.00000000000000a0*
     Unfortunately any pool name existing within our cluster give me back
an answer with no image information and a different watcher client ID.

So my questions are:

How can I identify which operation this OSD is trying to achieve as
osd_op() is a bit large ^^ ?
Does the *snapc *information part within the log relate to snapshot or is
that something totally different?
How can I identify the related images to this data chunk?
Is there official documentation about SLOW_OPS operations code explaining
how to read the logs like something that explains which block is PG
number, which is the ID of something etc?

Thanks a lot everyone and feel free to ask for additional information!
G.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx