Re: getting rid of incomplete pg errors

Hartwig Hauschild <ml-ceph@xxxxxxxxxxxx> · Wed, 29 Jan 2020 12:16:55 +0100

Hi, 

I had looked at the output of `ceph health detail` which told me to search
for 'incomplete' in the docs.
Since that said to file a bug (and I was sure that filing a bug did not
help) I continued to purge the Disks that we hat overwritten and ceph then
did some magic and told me that the PGs were again available on three OSDs
but were incomplete.

I have now gone ahead and marked all three of the OSDs where one of my
incomplete PGs is (according to `ceph pg ls incomplete`) as lost one by
one, waiting for ceph status to settle in between and that lead to the PG
now being incomplete on three different OSDs.
Also, force-create-pg tells me "already created".

Am 29.01.2020 schrieb Gregory Farnum:
> There should be docs on how to mark an OSD lost, which I would expect to be
> linked from the troubleshooting PGs page.
> 
> There is also a command to force create PGs but I don’t think that will
> help in this case since you already have at least one copy.
> 
> On Tue, Jan 28, 2020 at 5:15 PM Hartwig Hauschild <ml-ceph@xxxxxxxxxxxx>
> wrote:
> 
> > Hi.
> >
> > before I descend into what happened and why it happened: I'm talking about
> > a
> > test-cluster so I don't really care about the data in this case.
> >
> > We've recently started upgrading from luminous to nautilus, and for us that
> > means we're retiring ceph-disk in favour of ceph-volume with lvm and
> > dmcrypt.
> >
> > Our setup is in containers and we've got DBs separated from Data.
> > When testing our upgrade-path we discovered that running the host on
> > ubuntu-xenial and the containers on centos-7.7 leads to lvm inside the
> > containers not using lvmetad because it's too old. That in turn means that
> > not running `vgscan --cache` on the host before adding a LV to a VG
> > essentially zeros the metadata for all LVs in that VG.
> >
> > That happened on two out of three hosts for a bunch of OSDs and those OSDs
> > are gone. I have no way of getting them back, they've been overwritten
> > multiple times trying to figure out what went wrong.
> >
> > So now I have a cluster that's got 16 pgs in 'incomplete', 14 of them with
> > 0
> > objects, 2 with about 150 objects each.
> >
> > I have found a couple of howtos that tell me to use ceph-objectstore-tool
> > to
> > find the pgs on the active osds and I've given that a try, but
> > ceph-objectstore-tool always tells me it can't find the pg I am looking
> > for.
> >
> > Can I tell ceph to re-init the pgs? Do I have to delete the pools and
> > recreate them?
> >
> > There's no data I can't get back in there, I just don't feel like
> > scrapping and redeploying the whole cluster.
> >
> >
> > --
> > Cheers,
> >         Hardy
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> >

-- 
Cheers,
	Hardy
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx