Re: flock is held after ceph-osd daemon being stopped

Jeff Layton <jlayton@xxxxxxxxxx> · Tue, 18 Feb 2020 07:11:33 -0500

On Fri, 2020-02-14 at 07:13 -0800, Yiming Zhang wrote:
> > On Feb 13, 2020, at 3:52 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > 
> > If the OSD daemon dies, then it will have closed all of its fd's and
> > there should be no more lock. Therefore you almost certainly have some
> > other process running that is holding the lock.
> > 
> > You may have to do a bit of digging in /proc/locks. Determine the
> > dev+inode number of the file on which the lock is being set and find it
> > in /proc/locks. Then you can track down the PID that's holding that
> > lock.
> > 
> I have checked the locks with lslocks, here is the locks when I vstarted ceph (bluestore block = /dev/sdc where sdc is a raw device):
> COMMAND           PID  TYPE SIZE MODE  M START END PATH
> ceph-mgr        19852 POSIX      WRITE 0     0   0 /...
> iscsid           1061 POSIX      WRITE 0     0   0 /run...
> ceph-mgr        14889 POSIX      WRITE 0     0   0 /...
> rpcbind           990 FLOCK      WRITE 0     0   0 /run...
> ceph-mon        16430 POSIX      WRITE 0     0   0 /...
> ceph-mon        16430 POSIX      WRITE 0     0   0 /...
> ceph-mon        18107 POSIX      WRITE 0     0   0 /...
> ceph-mon        18107 POSIX      WRITE 0     0   0 /...
> ceph-mon        19711 POSIX      WRITE 0     0   0 /...
> ceph-mon        19711 POSIX      WRITE 0     0   0 /...
> ceph-mon        10495 POSIX      WRITE 0     0   0 /...
> ceph-mon        10495 POSIX      WRITE 0     0   0 /...
> ceph-mon        14748 POSIX      WRITE 0     0   0 /...
> ceph-mon        14748 POSIX      WRITE 0     0   0 /...
> cron             1085 FLOCK      WRITE 0     0   0 /run...
> ceph-mgr        18247 POSIX      WRITE 0     0   0 /...
> atd              1111 POSIX      WRITE 0     0   0 /run...
> lvmetad           807 POSIX      WRITE 0     0   0 /run...
> ceph-mgr        10635 POSIX      WRITE 0     0   0 /...
> ceph-mgr        16571 POSIX      WRITE 0     0   0 /…
> 
> Then I kill all related processes and restart cluster, the error “_lock flock failed on /users/xxx/ceph/build/dev/osd0/block” persists. 
> 
> After the kill, locks are:
> COMMAND           PID  TYPE SIZE MODE  M START END PATH
> rpcbind         20267 FLOCK      WRITE 0     0   0 /run...
> lvmetad         20266 POSIX      WRITE 0     0   0 /run…
> 
> The error happens in KernelDevice.cc:
> int r = ::flock(fd_directs[WRITE_LIFE_NOT_SET], LOCK_EX | LOCK_NB);
> Where r gives -1, and fd_directs[WRITE_LIFE_NOT_SET] will give 11, and WRITE_LIFE_NOT_SET is 0.
> 
> Any suggestions how to proceed with the issue? 
> 

Sorry, no. Any lock set on a block device should show up in /proc/locks
(as it uses the kernel's generic flock lock mechanism for local
filesystems).

You may want to play with strace and verify that the error is coming
from the kernel and that the program is attempting to set the lock on
the file you think it is.

What kernel is this running on?
-- 
Jeff Layton <jlayton@xxxxxxxxxx>
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx