Re: Mimic offline problem

Darius Kasparavičius <daznis@xxxxxxxxx> · Tue, 2 Oct 2018 19:16:44 +0300



Hello,

 Currently you have 15 objects missing. I would recommend finding them
and making backups of them. Ditch all other osds that are failing to
start and concentrate on bringing online those that have missing
objects. Then slowly turn off nodown and noout on the cluster and see
if it stabilises. If it stabilises leave these setting if not turn
them back on.
Now get some of the pg's that are blocked and querry the pgs to check
why they are blocked. Try removing as much blocks as possible and then
remove the norebalance/norecovery flags and see if it starts to fix
itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
<morphinwithyou@xxxxxxxxx> wrote:
>
> One of ceph experts indicated that bluestore is somewhat preview tech
> (as for Redhat).
> So it could be best to checkout bluestore and rocksdb. There are some
> tools to check health and also repair. But there are limited
> documentation.
> Anyone who has experince with it?
> Anyone lead/help to a proper check would be great.
> Goktug Yildirim <goktug.yildirim@xxxxxxxxx>, 1 Eki 2018 Pzt, 22:55
> tarihinde şunu yazdı:
> >
> > Hi all,
> >
> > We have recently upgraded from luminous to mimic. It’s been 6 days since this cluster is offline. The long short story is here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
> >
> > I’ve also CC’ed developers since I believe this is a bug. If this is not to correct way I apology and please let me know.
> >
> > For the 6 days lots of thing happened and there were some outcomes about the problem. Some of them was misjudged and some of them are not looked deeper.
> > However the most certain diagnosis is this: each OSD causes very high disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs become unresponsive or very very less responsive. For example "ceph tell osd.x version” stucks like for ever.
> >
> > So due to unresponsive OSDs cluster does not settle. This is our problem!
> >
> > This is the one we are very sure of. But we are not sure of the reason.
> >
> > Here is the latest ceph status:
> > https://paste.ubuntu.com/p/2DyZ5YqPjh/.
> >
> > This is the status after we started all of the OSDs 24 hours ago.
> > Some of the OSDs are not started. However it didnt make any difference when all of them was online.
> >
> > Here is the debug=20 log of an OSD which is same for all others:
> > https://paste.ubuntu.com/p/8n2kTvwnG6/
> > As we figure out there is a loop pattern. I am sure it wont caught from eye.
> >
> > This the full log the same OSD.
> > https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
> >
> > Here is the strace of the same OSD process:
> > https://paste.ubuntu.com/p/8n2kTvwnG6/
> >
> > Recently we hear more to uprade mimic. I hope none get hurts as we do. I am sure we have done lots of mistakes to let this happening. And this situation may be a example for other user and could be a potential bug for ceph developer.
> >
> > Any help to figure out what is going on would be great.
> >
> > Best Regards,
> > Goktug Yildirim
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com