Hello Jason, it got some further hints. Please see below. Am 15.05.2017 um 22:25 schrieb Jason Dillaman: > On Mon, May 15, 2017 at 3:54 PM, Stefan Priebe - Profihost AG > <s.priebe@xxxxxxxxxxxx> wrote: >> Would it be possible that the problem is the same you fixed? > > No, I would not expect it to be related to the other issues you are > seeing. The issue I just posted a fix against only occurs when a > client requests the lock from the current owner, which will only occur > under the following scenarios: (1) attempt to write to the image > locked by another client, (2) attempt to disable image features on an > image locked by another client, (3) demote a primary mirrored image > when locked by another client, or (4) the rbd CLI attempted to perform > an operation not supported by the currently running lock owner client > due to version mismatch. ah OK. Mhm nothing i would expect. > I am assuming you are not running two VMs concurrently using the same > backing RBD image, so that would eliminate possibility (1). No i do not. I investigated a lot of time in analyzing the log files. What i can tell so far is: 1.) it happens very often, when we issue a fstrim command on the root device of a vm. We're using Qemu virtio-scsi backend with: cache=writeback,aio=threads,detect-zeroes=unmap,discard=on 2.) but it also happens on other unknown "operations" - at least fstrim seems to trigger it at best 3.) it happens once or twice a night while doing around 1500-2000 backups. So it looks like a race to me. 3.) it still happens on pre jewel images even when they got restarted / killed and reinitialized. In that case they've the asok socket available for now. Should i issue any command to the socket to get log out of the hanging vm? Qemu is still responding just ceph / disk i/O gets stalled. Greets, Stefan _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com