Re: Is it possible to fix corrupted osd superblock?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 1 Feb 2019, Sage Weil wrote:
> On Fri, 1 Feb 2019, Yury Z wrote:
> > On Thu, 31 Jan 2019 23:27:21 +0000 (UTC)
> > Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > 
> > > On Thu, 31 Jan 2019, Sage Weil wrote:
> > > > On Thu, 31 Jan 2019, Yury Z wrote:  
> > > > > Hi,
> > > > > 
> > > > > We've experimented with runing OSD's in docker containers. And
> > > > > got the situation when two OSD's started with the same block
> > > > > device. File locks inside mounted osd dir didn't catch that issue
> > > > > because mounted osd dirs where inside containers. So, we got
> > > > > corrupted osd_superblock at osd bluestore drive. And now OSD
> > > > > can't be started.  
> > > > 
> > > > AHA!  Someone else ran into this and it was a mystery to me how
> > > > this happened.  How did you identify locks as the culprit?  And can
> > > > you describe the situation that led to two competing containers
> > > > running ceph-osd?  
> > > 
> > > I looked into this a bit and I'm not sure competing docker containers 
> > > explains the issue.  The bluestore code takes a fcntl lock on the
> > > block device when it opens it before doing anything at all, and I
> > > *think* those should work just fine across the container boundaries.
> > 
> > As far as i can see, bluestore code takes a fcntl lock on the "fsid"
> > file inside osd dir, not block device. BlueStore::_lock_fsid method.
> > In our case, we have the same block device, but different osd dirs for
> > each ceph-osd docker container. So, they can't detect each other and
> > prevent simultaneous rw operations on the same block device.
> 
> The KernelDevice.cc *also* takes a lock on the block device itself, 
> which should be the same inode across any containers.  I'm trying to 
> figure out why that lock isn't working, though... :/

Okay, Jeff helped me figure out the problem... it's an annoying property 
of posix locks that closing *any* fd on the file drops all locks.  Here's 
a PR that fixes the bluestore locking to use flock(2) instead:

	https://github.com/ceph/ceph/pull/26245

Now, for your broken OSDs, I have a patch that adds an option 
bluestore_ignore_data_csum.  There's a backport to luminous on top of 
12.2.8 that's pushed to https://shaman.ceph.com/builds/ceph/ that should 
spit out packages for you in about an hour.

Please do *NOT* try starting the OSD from this package (yet) as the patch 
isn't tested and we don't know how severe the damage is.

Instead, take the ceph-bluestore-tool and use it to run an fsck on one or 
more of your broken OSDs and collect a log, like so:

CEPH_ARGS="--bluestore-ignore-data-csum" ceph-bluestore-tool fsck \
	--path /path/to/osd \
	-l log --log-level 20 --deep 1

and then please share the log.  Thanks!
sage



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux