Re: Watcher Issue

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Thu, 23 Jan 2025 17:48:52 +0100 (CET)

Hi Reid,

Yep, it should definitely help if the client node (kernel) is not accessing the image anymore.

Thanks for sharing the tracker. It's good to know that a fix is on the way.

Cheers,
Frédéric.

----- Le 23 Jan 25, à 15:02, Reid Guyett reid.guyett@xxxxxxxxx a écrit :

> Hi,
> 
> I've had a similar issue but outside of ceph-csi. Running a CRUD test to
> (create, map, write, read, unmap, and delete) an RBD in a short amount of
> time can result in it having a stuck watcher. I assume it is from mapping
> and unmapping very quickly (under 30 sec).
> What I have found is if you restart the primary osd for the header object,
> the watcher will go away assuming nothing is really watching it.
> 
>> rbd info -p pool-name rbd-name
>> # get the id from the output. ex: 1234
>> ceph osd map pool-name rbd_header.1234
>> # get the primary under acting pNNN ex 43
>> ceph osd down 43
>>
> 
> This is the tracker <https://tracker.ceph.com/issues/58120> I'm watching
> and the backport says it should be fixed in 18.2.5
> 
> Hope this helps,
> Reid
> 
> 
> On Wed, Jan 22, 2025 at 4:14 PM Devender Singh <devender@xxxxxxxxxx> wrote:
> 
>> Hello Frederic
>>
>> Thanks for your email.
>> We already verified those and tried killing them and upgrading the k8s and
>> cis-plugin to but nothing helps.
>> Below is the output.. did not report any volume..
>>
>> # for pod in $(kubectl -n $namespace get pods | grep -E
>> 'rbdplugin|nodeplugin' | grep -v provisioner | awk '{print $1}'); do echo
>> $pod; kubectl exec -it -n $namespace $pod -c csi-rbdplugin -- rbd device
>> list | grep $image ; done
>> ceph-csi-rbd-nodeplugin-48vs2
>> ceph-csi-rbd-nodeplugin-6zmjj
>> ceph-csi-rbd-nodeplugin-7g6r5
>> ceph-csi-rbd-nodeplugin-bp84x
>> ceph-csi-rbd-nodeplugin-bt6hh
>> ceph-csi-rbd-nodeplugin-d4tww
>> ceph-csi-rbd-nodeplugin-rtb68
>> ceph-csi-rbd-nodeplugin-t87db
>>
>> But still error ;
>> # date;kubectl -n elastic describe pod/es-es-default-3 |grep -i warning
>> Wed 22 Jan 2025 01:12:09 PM PST
>>   Warning  FailedMount  2s (x13 over 21m)  kubelet
>> MountVolume.MountDevice failed for volume "pvc-3a2048f1" : rpc error: code
>> = Internal desc = rbd image k8s-rgnl-disks/csi-vol-945c6a66 is still being
>> used
>>
>>
>> Regards
>> Dev
>>
>> > On Jan 21, 2025, at 11:50 PM, Frédéric Nass <
>> frederic.nass@xxxxxxxxxxxxxxxx> wrote:
>> >
>> > Hi Dev,
>> >
>> > Can you run the below command to check if this image is still considered
>> as mapped by any ceph-csi nodeplugins?
>> >
>> > $ namespace=ceph-csi-rbd
>> > $ image=csi-vol-945c6a66-9129
>> > $ for pod in $(kubectl -n $namespace get pods | grep -E
>> 'rbdplugin|nodeplugin' | grep -v provisioner | awk '{print $1}'); do echo
>> $pod; kubectl exec -it -n $namespace $pod -c csi-rbdplugin -- rbd device
>> list | grep $image ; done
>> >
>> > If it pops out in the output, get into the csi-rbdplugin container of
>> the nodeplugin pod that listed the image and unmount/unmap it:
>> >
>> > $ kubectl -n $namespace exec -ti ceph-csi-rbd-nodeplugin-xxxxx -c
>> csi-rbdplugin -- sh           <---- please adjust nodepluding pod name here
>> > sh-4.4#
>> > sh-4.4# rbd device list
>> > id  pool           namespace  image                  snap  device
>> > 0   k8s-rgnl-disks            csi-vol-945c6a66-9129  -     /dev/rbd0
>> > sh-4.4# umount /dev/rbd/k8s-rgnl-disks/csi-vol-945c6a66-9129
>> > sh-4.4# rbd unmap /dev/rbd/k8s-rgnl-disks/csi-vol-945c6a66-9129
>> > sh-4.4# rbd device list
>> > sh-4.4#
>> >
>> > Hope there's no typo.
>> >
>> > Regards,
>> > Frédéric.
>> >
>> > ----- Le 21 Jan 25, à 23:33, Devender Singh devender@xxxxxxxxxx <mailto:
>> devender@xxxxxxxxxx> a écrit :
>> >
>> >> Hello Eugen
>> >>
>> >> Thanks for your reply.
>> >> I have the image available and it’s not under trash.
>> >>
>> >> When scaling a pod to different node using statefulset, pod gives mount
>> issue.
>> >>
>> >> I was looking for a command if we can kill the client.id <
>> http://client.id/> <
>> https://www.google.com/url?q=http://client.id/&source=gmail-imap&ust=1738137024000000&usg=AOvVaw10QRl9S7YS6pPaI6JKmdyy
>> >
>> >> from ceph. CEPH must have a command to kill its clients etc…
>> >> Don’t understand why pod complaining about same volume name about a k8s
>> host
>> >> using it. Whereas its nowhere.. Not sure what to do in this situation..
>> >> We tried upgrading csi, k8s cluster. Renamed image and blocklisted the
>> host. And
>> >> renamed back image to its original image but still red status showing
>> same
>> >> client host.
>> >>
>> >>
>> >> Regards
>> >> Dev
>> >>
>> >>> On Jan 21, 2025, at 12:16 PM, Eugen Block <eblock@xxxxxx> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> have you checked if the image is in the trash?
>> >>>
>> >>> rbd -p {pool} trash ls
>> >>>
>> >>> You can try to restore the image if there is one, then blocklist the
>> client to
>> >>> release the watcher, then delete the image again.
>> >>>
>> >>> I have to do that from time to time on a customer’s openstack cluster.
>> >>>
>> >>> Zitat von Devender Singh <devender@xxxxxxxxxx>:
>> >>>
>> >>>> Hello
>> >>>>
>> >>>> Seeking some help if I can clean the client mounting my volume?
>> >>>>
>> >>>> rbd status pool/image
>> >>>>
>> >>>> Watchers:
>> >>>>    watcher=10.160.0.245:0/2076588905 client.12541259
>> cookie=140446370329088
>> >>>>
>> >>>> Issue: pod is failing in init- state.
>> >>>> Events:
>> >>>> Type     Reason       Age                  From     Message
>> >>>> ----     ------       ----                 ----     -------
>> >>>> Warning  FailedMount  96s (x508 over 24h)  kubelet
>> MountVolume.MountDevice
>> >>>> failed for volume "pvc-3a2048f1" : rpc error: code = Internal desc =
>> rbd image
>> >>>> k8s-rgnl-disks/csi-vol-945c6a66-9129 is still being used
>> >>>>
>> >>>> It shows above client, but there is no such volume…
>> >>>>
>> >>>> Another similar issue… on dashboard…
>> >>>>
>> >>>> CephNodeDiskspaceWarning
>> >>>> Mountpoint /mnt/dst-volume on sea-prod-host01 will be full in less
>> than 5 days
>> >>>> based on the 48 hour trailing fill rate.
>> >>>>
>> >>>> Whereas nothing is mounted, I mapped one image yesterday using red
>> map and then
>> >>>> unmapped and unmounted everything but it been more than 12hours now,
>> still
>> >>>> showing the message..
>> >>>>
>> >>>>
>> >>>> CEPH version: 18.2.4
>> >>>>
>> >>>> Regards
>> >>>> Dev
>> >>>>
>> >>>>
>> >>>>
>> >>>> _______________________________________________
>> >>>> ceph-users mailing list -- ceph-users@xxxxxxx
>> >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> ceph-users mailing list -- ceph-users@xxxxxxx
>> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >>
>> >> _______________________________________________
>> >> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:
>> ceph-users@xxxxxxx>
>> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:
>> ceph-users-leave@xxxxxxx>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx