Re: OSD SLOW_OPS is filling MONs disk space

Gaël THEROND <gael.therond@xxxxxxxxxxxx> · Wed, 23 Feb 2022 16:30:04 +0100

So!

Here is really mysterious resolution.
The issue vanished at the moment I requested the osd about its slow_ops
history.

I didn’t had time to do anything except to look for the osd ops history
that was actually empty :-)

I’ll keep all your suggestions if it ever came back :-)

Thanks a lot!

Le mer. 23 févr. 2022 à 12:51, Gaël THEROND <gael.therond@xxxxxxxxxxxx> a
écrit :

> Thanks a lot Eugene, I dumbly forgot about the rbd block prefix!
>
> I’ll try that this afternoon and told you how it went.
>
> Le mer. 23 févr. 2022 à 11:41, Eugen Block <eblock@xxxxxx> a écrit :
>
>> Hi,
>>
>> > How can I identify which operation this OSD is trying to achieve as
>> > osd_op() is a bit large ^^ ?
>>
>> I would start by querying the OSD for historic_slow_ops:
>>
>> ceph daemon osd.<OSD> dump_historic_slow_ops to see which operation it is.
>>
>> > How can I identify the related images to this data chunk?
>>
>> You could go through all rbd images and check for the line containing
>> block_name_prefix, this could take some time depending on how many
>> images you have:
>>
>>          block_name_prefix: rbd_data.ca69416b8b4567
>>
>> I sometimes do that with this for loop:
>>
>> for i in `rbd -p <POOL> ls`; do if [ $(rbd info <POOL>/$i | grep -c
>> <PREFIX>) -gt 0 ]; then echo "image: $i"; break; fi; done
>>
>> So in your case it would look something like this:
>>
>> for i in `rbd -p <POOL> ls`; do if [ $(rbd info <POOL>/$i | grep -c
>> 89a4a940aba90b -gt 0 ]; then echo "image: $i"; break; fi; done
>>
>> To see which clients are connected you can check the mon daemon:
>>
>> ceph daemon mon.<MON> sessions
>>
>> The mon daemon also has a history of slow ops:
>>
>> ceph daemon mon.<MON> dump_historic_slow_ops
>>
>> Regards,
>> Eugen
>>
>>
>> Zitat von Gaël THEROND <gael.therond@xxxxxxxxxxxx>:
>>
>> > Hi everyone, I'm having a really nasty issue since around two days where
>> > our cluster report a bunch of SLOW_OPS on one of our OSD as:
>> >
>> > https://paste.openstack.org/show/b3DkgnJDVx05vL5o4OmY/
>> >
>> > Here is the cluster specification:
>> >   * Used to store Openstack related data (VMs/Snaphots/Volumes/Swift).
>> >   * Based on CEPH Nautilus 14.2.8 installed using ceph-ansible.
>> >   * Use an EC based storage profile.
>> >   * We have a separate and dedicated frontend and backend 10Gbps
>> network.
>> >   * We don't have any network issues observed or reported by our
>> monitoring
>> > system.
>> >
>> > Here is our current cluster status:
>> > https://paste.openstack.org/show/biVnkm9Yyog3lmSUn0UK/
>> > Here is a detailed view of our cluster status:
>> > https://paste.openstack.org/show/bgKCSVuow0JUZITo2Ndj/
>> >
>> > My main issue here is that this health alert is starting to fill the
>> > Monitor's disk and so trigger a MON_DISK_BIG alert.
>> >
>> > I'm worried as I'm having a hard time to identify which osd operation is
>> > actually slow and especially, which image does it concern and which
>> client
>> > is using it.
>> >
>> > So far I've try:
>> >   * To match this client ID with any watcher of our stored
>> > volumes/vms/snaphots by extracting the whole list and then using the
>> > following command: *rbd status <pool>/<image>*
>> >      Unfortunately none of the watchers is matching my reported client
>> from
>> > the OSD on any pool.
>> >
>> > *  * *To map this reported chunk of data to any of our store image
>> > using:  *ceph
>> > osd map <pool>/rbd_data.5.89a4a940aba90b.00000000000000a0*
>> >      Unfortunately any pool name existing within our cluster give me
>> back
>> > an answer with no image information and a different watcher client ID.
>> >
>> > So my questions are:
>> >
>> > How can I identify which operation this OSD is trying to achieve as
>> > osd_op() is a bit large ^^ ?
>> > Does the *snapc *information part within the log relate to snapshot or
>> is
>> > that something totally different?
>> > How can I identify the related images to this data chunk?
>> > Is there official documentation about SLOW_OPS operations code
>> explaining
>> > how to read the logs like something that explains which block is PG
>> > number, which is the ID of something etc?
>> >
>> > Thanks a lot everyone and feel free to ask for additional information!
>> > G.
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx
>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx