Re: OSDs unable to mount BlueFS after reboot

Davíð Steinn Geirsson <david@xxxxxx> · Wed, 15 Sep 2021 22:09:24 +0000

On Wed, Sep 15, 2021 at 09:16:17PM +0200, Stefan Kooman wrote:
> On 9/15/21 21:02, Davíð Steinn Geirsson wrote:
> > Hi,
> > 
> > On Wed, Sep 15, 2021 at 08:39:11PM +0200, Stefan Kooman wrote:
> > > On 9/15/21 18:06, Davíð Steinn Geirsson wrote:
> > > > Just realised the debug paste I sent was for OSD 5 but the other info is for
> > > > OSD 0. They are both having the same issue, but for completeness sake here
> > > > is the debug output from OSD 0:
> > > > http://paste.debian.net/1211873/
> > > > 
> > > > All daemons in the cluster are running ceph pacific 16.2.5.
> > > 
> > > Can you increase debug level for the OSD, i.e. ceph config set osd.0
> > > debug_osd 20/20
> > > 
> > > And then restart the osd?
> > 
> > Sure, here is the output with 20/20:
> > https://paste.debian.net/1211886/
> > 
> > Only 3 lines added as far as I can tell:
> > 2021-09-15T18:44:03.289+0000 7fce2827af00  5 object store type is bluestore
> > [...]
> > 2021-09-15T18:44:05.673+0000 7fce2827af00  2 osd.0 0 init /var/lib/ceph/osd/ceph-0 (looks like hdd)
> > 2021-09-15T18:44:05.673+0000 7fce2827af00  2 osd.0 0 journal /var/lib/ceph/osd/ceph-0/journal
> > 
> > I tried again with debug_osd 99/99 (the maximum) and did not see any
> > additional messages.
> 
> Can you access this link: https://access.redhat.com/solutions/4939871 ?

Sadly no, my current employer has no active redhat subscription. But your
snippets gave me a good idea of the contents, thanks.

> 
> This would indicate rocksdb corruption. Can you fsck and repair the OSD?
> 
> # ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-0/ --debug
> 
> # ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-0/ --debug

Nope, both error out with:
```
root@janky:~# ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-0/ --debug
2021-09-15T21:42:54.616+0000 7faed8299240 -1 bluestore(/var/lib/ceph/osd/ceph-0) _open_db erroring opening db: 
repair failed: (5) Input/output error
```

> 
> The post states:
> 
> If fsck and repair does not help to recover from rocksdb corruption and all
> PGs areactive+clean then safest way is to re-deploy affected OSD. In case
> some PGs are incomplete or down, Kindly contact Red Hat Ceph Support.
> 
> And I agree with that. Better not try to fix things, but let Ceph do a clean
> recovery.

Absolutely. I already destroyed and re-created all affected OSDs except one,
and that one is marked out. I don't intend to put that OSD back in the
cluster, just wanted to keep it around in case some better information about
the root cause could be extracted from it.

> 
> As for the Root Cause
> 
> The rocksdb corruption on bluestore could be due to hard reboot of OSD node
> or block.db device medium error. The rocksdb in blueStore contains not only
> OMAPs and metadata for ceph objects, but also ceph objects layout on disk,
> entire delayed transactions, allocator free regions and others.
> 
> You might get more information with increasing debug for rocksdb / bluefs /
> bluestore
> 
> ceph config set osd.0 debug_rocksdb = 20/20
> ceph config set osd.0 debug_bluefs = 20/20
> ceph config set osd.0 debug_bluestore = 20/20

These debug tuneables give a lot more output and strongly support this being
corrupt RocksDB. The whole OSD output is quite large so I put only the last
part, from the BlueFS journal replay onwards:
https://paste.debian.net/1211916/

I am concerned by this incident, as I know I brought the machine down
cleanly and logs suggest all OSDs were gracefully terminated. Also none
of the drives are reporting any issues with uncorrectable sectors
(though I know not to put too much stock in drive error reporting). In
this case there is sufficient redundancy to correct everything, but if the
same also happened on other hosts that would not be the case. I'll put
the affected drives under a microscope, keep the OSD around for research
just in case, and keep digging hoping to find some explanation.

Thank you very much for your assistance Stefan, it's helped me a lot
getting to know the debugging features of ceph better.

> 
> Gr. Stefan
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

Regards,
Davíð
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx