Re: Is it possible to fix corrupted osd superblock?

Yury Z <aboutbus@xxxxxxxxx> · Tue, 5 Feb 2019 12:31:50 +0300

On Fri, 1 Feb 2019 17:56:37 +0000 (UTC)
Sage Weil <sage@xxxxxxxxxxxx> wrote:

> On Fri, 1 Feb 2019, Sage Weil wrote:
> > On Fri, 1 Feb 2019, Yury Z wrote:  
> > > On Thu, 31 Jan 2019 23:27:21 +0000 (UTC)
> > > Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > >   
> > > > On Thu, 31 Jan 2019, Sage Weil wrote:  
> > > > > On Thu, 31 Jan 2019, Yury Z wrote:    
> > > > > > Hi,
> > > > > > 
> > > > > > We've experimented with runing OSD's in docker containers.
> > > > > > And got the situation when two OSD's started with the same
> > > > > > block device. File locks inside mounted osd dir didn't
> > > > > > catch that issue because mounted osd dirs where inside
> > > > > > containers. So, we got corrupted osd_superblock at osd
> > > > > > bluestore drive. And now OSD can't be started.    
> > > > > 
> > > > > AHA!  Someone else ran into this and it was a mystery to me
> > > > > how this happened.  How did you identify locks as the
> > > > > culprit?  And can you describe the situation that led to two
> > > > > competing containers running ceph-osd?    
> > > > 
> > > > I looked into this a bit and I'm not sure competing docker
> > > > containers explains the issue.  The bluestore code takes a
> > > > fcntl lock on the block device when it opens it before doing
> > > > anything at all, and I *think* those should work just fine
> > > > across the container boundaries.  
> > > 
> > > As far as i can see, bluestore code takes a fcntl lock on the
> > > "fsid" file inside osd dir, not block device.
> > > BlueStore::_lock_fsid method. In our case, we have the same block
> > > device, but different osd dirs for each ceph-osd docker
> > > container. So, they can't detect each other and prevent
> > > simultaneous rw operations on the same block device.  
> > 
> > The KernelDevice.cc *also* takes a lock on the block device itself, 
> > which should be the same inode across any containers.  I'm trying
> > to figure out why that lock isn't working, though... :/  
> 
> Okay, Jeff helped me figure out the problem... it's an annoying
> property of posix locks that closing *any* fd on the file drops all
> locks.  Here's a PR that fixes the bluestore locking to use flock(2)
> instead:
> 
> 	https://github.com/ceph/ceph/pull/26245
> 
> Now, for your broken OSDs, I have a patch that adds an option 
> bluestore_ignore_data_csum.  There's a backport to luminous on top of 
> 12.2.8 that's pushed to https://shaman.ceph.com/builds/ceph/ that
> should spit out packages for you in about an hour.
> 
> Please do *NOT* try starting the OSD from this package (yet) as the
> patch isn't tested and we don't know how severe the damage is.
> 
> Instead, take the ceph-bluestore-tool and use it to run an fsck on
> one or more of your broken OSDs and collect a log, like so:
> 
> CEPH_ARGS="--bluestore-ignore-data-csum" ceph-bluestore-tool fsck \
> 	--path /path/to/osd \
> 	-l log --log-level 20 --deep 1
> 
> and then please share the log.  Thanks!

I got packages with bluestore_ignore_data_csum option
from /r/ceph/wip-bluestore-disable-csum-luminous/58cb3ecc36afd83048e595976bd733ac2b863a26/ubuntu/bionic/flavors/default/

1) Ran ceph-bluestore-tool. It finished with success and had the same
results/logs from run to run.

CEPH_ARGS="--bluestore-ignore-data-csum" ceph-bluestore-tool fsck \
	--path /var/lib/ceph/osd/ceph-74/ \
	-l ceph-blustore-tool-ignore-csum.log 
	--log-level 20 --deep 1 \
	> ceph-bluestore-tool-ignore-csum.stdout.log 2>&1

2) And ran ceph-osd. It crashed and had the same results/logs from run
to run.

CEPH_ARGS="--bluestore-ignore-data-csum" /usr/bin/ceph-osd -d \
	--debug_osd 20 --debug_bdev 20 \
	--debug_bluefs 20 --debug_bluestore 20 \
	--cluster ceph --id 74 \
	> ceph-osd-ignore-csum.log 2>&1

I shared all log files at my google drive:
https://drive.google.com/drive/folders/1qSf7DnIi0srDY-1IIX3KsmrgmV5Kwens?usp=sharing

The data from broken OSDs is not very important for us. We don't need
to restore them by any means.

Preventing such failures of OSDs is much more important. We are going
to check and use your new PR in our setup with containers.

I could provide any information, logs, etc from our setup if you
need it for investigation.

Thank you for your help!