Re: rbd snap create now working and just hangs forever

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Apr 23, 2021 at 1:12 PM Boris Behrens <bb@xxxxxxxxx> wrote:
>
>
>
> Am Fr., 23. Apr. 2021 um 13:00 Uhr schrieb Ilya Dryomov <idryomov@xxxxxxxxx>:
>>
>> On Fri, Apr 23, 2021 at 12:46 PM Boris Behrens <bb@xxxxxxxxx> wrote:
>> >
>> >
>> >
>> > Am Fr., 23. Apr. 2021 um 12:16 Uhr schrieb Ilya Dryomov <idryomov@xxxxxxxxx>:
>> >>
>> >> On Fri, Apr 23, 2021 at 12:03 PM Boris Behrens <bb@xxxxxxxxx> wrote:
>> >> >
>> >> >
>> >> >
>> >> > Am Fr., 23. Apr. 2021 um 11:52 Uhr schrieb Ilya Dryomov <idryomov@xxxxxxxxx>:
>> >> >>
>> >> >>
>> >> >> This snippet confirms my suspicion.  Unfortunately without a verbose
>> >> >> log from that VM from three days ago (i.e. when it got into this state)
>> >> >> it's hard to tell what exactly went wrong.
>> >> >>
>> >> >> The problem is that the VM doesn't consider itself to be the rightful
>> >> >> owner of the lock and so when "rbd snap create" requests the lock from
>> >> >> it in order to make a snapshot, the VM just ignores the request because
>> >> >> even though it owns the lock, its record appears to be of sync.
>> >> >>
>> >> >> I'd suggest to kick it by restarting osd36.  If the VM is active, it
>> >> >> should reacquire the lock and hopefully update its internal record as
>> >> >> expected.  If "rbd snap create" still hangs after that, it would mean
>> >> >> that we have a reproducer and can gather logs on the VM side.
>> >> >>
>> >> >> What version of qemu/librbd and ceph is in use (both on the VM side and
>> >> >> on the side you are running "rbd snap create"?
>> >> >>
>> >> > I just stopped the OSD, waited some seconds and started it again.
>> >> > I still can't create snapshots.
>> >> >
>> >> > Ceph version is 14.2.18 accross the board
>> >> > qemu is 4.1.0-1
>> >> > as we use krbd, the kernel version is 5.2.9-arch1-1-ARCH
>> >> >
>> >> > How can I gather more logs to debug it?
>> >>
>> >> Are you saying that this image is mapped and the lock is held by the
>> >> kernel client?  It doesn't look that way from the logs you shared.
>> >
>> > We use krbd instead of librbd (at least this is what I think I know), but qemu is doing the kvm/rbd stuff.
>>
>> I'm going to assume that by "qemu is doing the kvm/rbd stuff", you
>> mean that you are using the librbd driver inside qemu and that this
>> image is opened by qemu (i.e. that driver).  If you don't know what
>> access method is being used, debugging this might be challenging ;)
>>
>> Let's start with the same output: "rbd lock ls", "rbd status" and "rbd
>> snap create --debug-ms=1 --debug-rbd=20".  It should be different after
>> osd36 was restarted.
>
> Here is the new one: https://pastebin.com/6qTsJK6W
> Ah ok, this CPU node still got the old thing and uses librbd to work with rbd instead of krbd.

Sorry, I forgot that simply restarting the OSD doesn't trigger the
code path that I'm hoping would cause librbd inside the VM to update
its state.  I took a look at the code and I think there are a couple
of ways to do it (listed in the order of preference):

- cut the network between the VM and the cluster for more than 30
  seconds; it should be done externally so that to the VM it looks
  like a long network blip

- stop the VM process for more than 30 seconds

  $ PID=<pid of qemu-system-x86_64 process>
  $ kill -STOP $PID && sleep 40 && kill -CONT $PID

- stop the osd36 process for more than 30 seconds with "nodown" flag
  set

  $ ceph osd set nodown
  $ PID=<pid of osd36 process>
  $ kill -STOP $PID && sleep 40 && kill -CONT $PID
  $ ceph osd unset nodown

Thanks,

                Ilya
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux