On Fri, 1 Feb 2019, Sage Weil wrote: > On Fri, 1 Feb 2019, Yury Z wrote: > > On Thu, 31 Jan 2019 23:27:21 +0000 (UTC) > > Sage Weil <sage@xxxxxxxxxxxx> wrote: > > > > > On Thu, 31 Jan 2019, Sage Weil wrote: > > > > On Thu, 31 Jan 2019, Yury Z wrote: > > > > > Hi, > > > > > > > > > > We've experimented with runing OSD's in docker containers. And > > > > > got the situation when two OSD's started with the same block > > > > > device. File locks inside mounted osd dir didn't catch that issue > > > > > because mounted osd dirs where inside containers. So, we got > > > > > corrupted osd_superblock at osd bluestore drive. And now OSD > > > > > can't be started. > > > > > > > > AHA! Someone else ran into this and it was a mystery to me how > > > > this happened. How did you identify locks as the culprit? And can > > > > you describe the situation that led to two competing containers > > > > running ceph-osd? > > > > > > I looked into this a bit and I'm not sure competing docker containers > > > explains the issue. The bluestore code takes a fcntl lock on the > > > block device when it opens it before doing anything at all, and I > > > *think* those should work just fine across the container boundaries. > > > > As far as i can see, bluestore code takes a fcntl lock on the "fsid" > > file inside osd dir, not block device. BlueStore::_lock_fsid method. > > In our case, we have the same block device, but different osd dirs for > > each ceph-osd docker container. So, they can't detect each other and > > prevent simultaneous rw operations on the same block device. > > The KernelDevice.cc *also* takes a lock on the block device itself, > which should be the same inode across any containers. I'm trying to > figure out why that lock isn't working, though... :/ Okay, Jeff helped me figure out the problem... it's an annoying property of posix locks that closing *any* fd on the file drops all locks. Here's a PR that fixes the bluestore locking to use flock(2) instead: https://github.com/ceph/ceph/pull/26245 Now, for your broken OSDs, I have a patch that adds an option bluestore_ignore_data_csum. There's a backport to luminous on top of 12.2.8 that's pushed to https://shaman.ceph.com/builds/ceph/ that should spit out packages for you in about an hour. Please do *NOT* try starting the OSD from this package (yet) as the patch isn't tested and we don't know how severe the damage is. Instead, take the ceph-bluestore-tool and use it to run an fsck on one or more of your broken OSDs and collect a log, like so: CEPH_ARGS="--bluestore-ignore-data-csum" ceph-bluestore-tool fsck \ --path /path/to/osd \ -l log --log-level 20 --deep 1 and then please share the log. Thanks! sage