> On Feb 18, 2020, at 4:11 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > On Fri, 2020-02-14 at 07:13 -0800, Yiming Zhang wrote: >>> On Feb 13, 2020, at 3:52 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote: >>> >>> If the OSD daemon dies, then it will have closed all of its fd's and >>> there should be no more lock. Therefore you almost certainly have some >>> other process running that is holding the lock. >>> >>> You may have to do a bit of digging in /proc/locks. Determine the >>> dev+inode number of the file on which the lock is being set and find it >>> in /proc/locks. Then you can track down the PID that's holding that >>> lock. >>> >> I have checked the locks with lslocks, here is the locks when I vstarted ceph (bluestore block = /dev/sdc where sdc is a raw device): >> COMMAND PID TYPE SIZE MODE M START END PATH >> ceph-mgr 19852 POSIX WRITE 0 0 0 /... >> iscsid 1061 POSIX WRITE 0 0 0 /run... >> ceph-mgr 14889 POSIX WRITE 0 0 0 /... >> rpcbind 990 FLOCK WRITE 0 0 0 /run... >> ceph-mon 16430 POSIX WRITE 0 0 0 /... >> ceph-mon 16430 POSIX WRITE 0 0 0 /... >> ceph-mon 18107 POSIX WRITE 0 0 0 /... >> ceph-mon 18107 POSIX WRITE 0 0 0 /... >> ceph-mon 19711 POSIX WRITE 0 0 0 /... >> ceph-mon 19711 POSIX WRITE 0 0 0 /... >> ceph-mon 10495 POSIX WRITE 0 0 0 /... >> ceph-mon 10495 POSIX WRITE 0 0 0 /... >> ceph-mon 14748 POSIX WRITE 0 0 0 /... >> ceph-mon 14748 POSIX WRITE 0 0 0 /... >> cron 1085 FLOCK WRITE 0 0 0 /run... >> ceph-mgr 18247 POSIX WRITE 0 0 0 /... >> atd 1111 POSIX WRITE 0 0 0 /run... >> lvmetad 807 POSIX WRITE 0 0 0 /run... >> ceph-mgr 10635 POSIX WRITE 0 0 0 /... >> ceph-mgr 16571 POSIX WRITE 0 0 0 /… >> >> Then I kill all related processes and restart cluster, the error “_lock flock failed on /users/xxx/ceph/build/dev/osd0/block” persists. >> >> After the kill, locks are: >> COMMAND PID TYPE SIZE MODE M START END PATH >> rpcbind 20267 FLOCK WRITE 0 0 0 /run... >> lvmetad 20266 POSIX WRITE 0 0 0 /run… >> >> The error happens in KernelDevice.cc: >> int r = ::flock(fd_directs[WRITE_LIFE_NOT_SET], LOCK_EX | LOCK_NB); >> Where r gives -1, and fd_directs[WRITE_LIFE_NOT_SET] will give 11, and WRITE_LIFE_NOT_SET is 0. >> >> Any suggestions how to proceed with the issue? >> > > Sorry, no. Any lock set on a block device should show up in /proc/locks > (as it uses the kernel's generic flock lock mechanism for local > filesystems). > > You may want to play with strace and verify that the error is coming > from the kernel and that the program is attempting to set the lock on > the file you think it is. > > What kernel is this running on? The kernel is 4.15.0-70-generic( I also has the same issue on another kernel 4.15.18-041518-generic). I used the strace to track the issue, and it led to this paticular function _lock in KernelDevice (`r = _lock();` in KernelDevice::open function). If I commented it out, the error goest away. But it’s not a fix. Maybe there is a bug here. I’ll keep digging this. Thanks, -ym > -- > Jeff Layton <jlayton@xxxxxxxxxx> _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx