Re: Fwd: PrimaryLogPG.cc: 11550: FAILED ceph_assert(head_obc)

Jake Grimmett <jog@xxxxxxxxxxxxxxxxx> · Tue, 11 Feb 2020 16:32:39 +0000

Quick Update in case anyone reads my previous post.

No ideas were forthcoming on how to fix the assert that was flapping the
OSD (caused by deleting unfound objects).

The affected pg was readable, so we decided to recycle the OSD...

destroy the flapping primary OSD
# ceph osd destroy 443 --force

purge the lvm entry for this disk
#lvremove
/dev/ceph-64b0010b-e397-49c2-ab01-6e43e6e5b41a/osd-block-fb824e45-d35f-486c-a4ca-05e5937eceae

zap the disk, it's the only way to be sure...
# ceph-volume lvm zap  /dev/sdab

reuse the drive & OSD number
# ceph-volume lvm prepare --osd-id 443 --data /dev/sdab

activate the OSD
# ceph-volume lvm activate 443 6e252371-d158-4d16-ac31-fed8f7d0cb1f

Now watching to see if the cluster recovers...

best,

Jake

On 2/10/20 3:31 PM, Jake Grimmett wrote:
> Dear All,
> 
> Following a clunky* cluster restart, we had
> 
> 23 "objects unfound"
> 14 pg recovery_unfound
> 
> We could see no way to recover the unfound objects, we decided to mark
> the objects in one pg unfound...
> 
> [root@ceph1 bad_oid]# ceph pg 5.f2f mark_unfound_lost delete
> pg has 2 objects unfound and apparently lost marking
> 
> Unfortunately, this immediately crashed the primary OSD for this PG:
> 
> OSD log showing the osd crashing 3 times here: <http://p.ip.fi/gV8r>
> 
> the assert was :>
> 
> 2020-02-10 13:38:45.003 7fa713ef3700 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.6/rpm/el7/BUILD/ceph-14.2.6/src/osd/PrimaryLogPG.cc:
> In function 'int PrimaryLogPG::recover_missing(const hobject_t&,
> eversion_t, int, PGBackend::RecoveryHandle*)' thread 7fa713ef3700 time
> 2020-02-10 13:38:45.000875
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.6/rpm/el7/BUILD/ceph-14.2.6/src/osd/PrimaryLogPG.cc:
> 11550: FAILED ceph_assert(head_obc)
> 
> 
> Questions..
> 
> 1) Is it possible to recover the flapping OSD? or should we fail out the
> flapping OSD and hope the cluster recovers?
> 
> 2) We have 13 other pg with unfound objects. Do we need to mark_unfound
> these one at a time, and then fail out their primary OSD? (allowing the
> cluster to recover before mark_unfound the next pg & failing it's
> primary OSD)
> 
> 
> 
> * thread describing the bad restart :>
> <https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/IRKCDRRAH7YZEVXN5CH4JT2NH4EWYRGI/#IRKCDRRAH7YZEVXN5CH4JT2NH4EWYRGI>
> 
> many thanks!
> 
> Jake
> 

-- 
Dr Jake Grimmett
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx