Re: Mimic offline problem

Goktug Yildirim <goktug.yildirim@xxxxxxxxx> · Wed, 3 Oct 2018 14:16:06 +0200

Hi Sage,
Thank you for your response. Now I am sure this incident is going to be resolved.
The problem started when 7 server crashed same time and they came back after ~5 minutes. 

Two of our 3 mon services were restarted in this crash. Since mon services are enabled they should be started nearly at the same time. I dont know if this makes any difference but some of the guys on IRC told it is required that they start in order not at the same time. Otherwise it could break things badly.

After 9 days we still see 3400-3500 active+clear PG. But in the end we have so many STUCK request and our cluster can not heal itself.

When we set noup flag, OSDs can catch up epoch easily. But when we unset the flag we see so many STUCKS and SLOW OPS in 1 hour.
I/O load on all of my OSD disks are at around %95 utilization and never ends. CPU and RAM usage are OK.
OSDs get stuck that we even can't run “ceph pg osd.0 query”.

Also we tried to change RBD pool replication size 2 to 1. Our goal was the eliminate older PG's and leaving cluster with good ones. 
With replication size=1 we saw "%13 PGS not active”. But it didn’t solve our problem. 

Of course we have to save %100 of data. But we feel like even saving %50 of our data will be make us very happy right now. 

This is what happens when the cluster starts. I believe it explains the whole story very nicely.
https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing

This is our ceph.conf:
https://paste.ubuntu.com/p/8sQhfPDXnW/

This is the output of "osd stat && osd epochs && ceph -s && ceph health”:
https://paste.ubuntu.com/p/g5t8xnrjjZ/

This is pg dump:
https://paste.ubuntu.com/p/zYqsN5T95h/

This is iostat & perf top:
https://paste.ubuntu.com/p/Pgf3mcXXX8/

This strace output of ceph-osd:
https://paste.ubuntu.com/p/YCdtfh5qX8/

This is OSD log (default debug):
https://paste.ubuntu.com/p/Z2JrrBzzkM/

This is leader MON log (default debug):
https://paste.ubuntu.com/p/RcGmsVKmzG/

These are OSDs failed to start. Total number is 58.
https://paste.ubuntu.com/p/ZfRD5ZtvpS/
https://paste.ubuntu.com/p/pkRdVjCH4D/
https://paste.ubuntu.com/p/zJTf2fzSj9/
https://paste.ubuntu.com/p/xpJRK6YhRX/
https://paste.ubuntu.com/p/SY3576dNbJ/
https://paste.ubuntu.com/p/smyT6Y976b/

This is OSD video with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing

This is OSD logfile with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing

As far as I understand OSD catchs up with the mon epoch and exceeds mon epoch somehow??

2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 mkpg 66.f8 e60196@2018-09-28 23:57:08.251119
2018-10-03 14:55:08.653 7f66c0bf9700 10 osd.150 72642 build_initial_pg_history 66.f8 created 60196
2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 get_map 60196 - loading and decoding 0x19da8400
2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <> waiting <> waiting_peering {}
2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) queued
2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642)> waiting <> waiting_peering {}
2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) pg 0xb579400
2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 pg_epoch: 72642 pg[66.d8( v 39934'8971934 (38146'8968839,39934'8971934] local-lis/les=72206/72212 n=2206 ec=50786/50786 lis/c 72206/72206 les/c/f 72212/72212/0 72642/72642/72642) [150] r=0 lpr=72642 pi=[72206,72642)/1 crt=39934'8971934 lcod 0'0 mlcod 0'0 peering mbc={} ps=[1~11]] do_peering_event: epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info
2018-10-03 14:55:08.653 7f66a6bc5700 10 log is not dirty
2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 72642 queue_want_up_thru want 72642 <= queued 72642, currently 72206
2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process empty q, waiting
2018-10-03 14:55:08.665 7f66c0bf9700 10 osd.150 72642 add_map_bl 60196 50012 bytes
2018-10-03 14:55:08.665 7f66c0bf9700 20 osd.150 72642 get_map 60197 - loading and decoding 0x19da8880
2018-10-03 14:55:08.669 7f66c0bf9700 10 osd.150 72642 add_map_bl 60197 50012 bytes
2018-10-03 14:55:08.669 7f66c0bf9700 20 osd.150 72642 get_map 60198 - loading and decoding 0x19da9180

On 3 Oct 2018, at 05:14, Sage Weil <sage@xxxxxxxxxxxx> wrote:

osd_find_best_info_ignore_history_les is a dangerous option and you should 
only use it in very specific circumstances when directed by a developer.  
In such cases it will allow a stuck PG to peer.  But you're not getting to 
that point...you're seeing some sort of resource exhaustion.

The noup trick works when OSDs are way behind on maps and all need to 
catch up.  The way to tell if they are behind is by looking at the 'ceph 
daemon osd.NNN status' output and comparing to the latest OSDMap epoch tha 
t the mons have.  Were they really caught up when you unset noup?

I'm just catching up and haven't read the whole thread but I haven't seen 
anything that explains why teh OSDs are dong lots of disk IO.  Catching up 
on maps could explain it but not why they wouldn't peer once they were all 
marked up...

sage

On Tue, 2 Oct 2018, Göktuğ Yıldırım wrote:

Anyone heart about osd_find_best_info_ignore_history_les = true ?
Is that be usefull here? There is such a less information about it.

Goktug Yildirim <goktug.yildirim@xxxxxxxxx> şunları yazdı (2 Eki 2018 22:11):

Hi,

Indeed I left ceph-disk to decide the wal and db partitions when I read somewhere that that will do the proper sizing. 
For the blustore cache size I have plenty of RAM. I will increase 8GB for each and decide a more calculated number    after cluster settles.

For the osd map loading I’ve also figured it out. And it is in loop. For that reason I started cluster with noup flag and waited OSDs to reach the uptodate epoch number. After that I unset noup. But I did not pay attention to manager logs. Let me check it, thank you!

I am not forcing jmellac or anything else really. I have a very standard installation and no tweaks or tunings. All we ask for the stability versus speed from the begining. And here we are :/

On 2 Oct 2018, at 21:53, Darius Kasparavičius <daznis@xxxxxxxxx> wrote:

Hi,

I can see some issues from the osd log file. You have an extremely low
size db and wal partitions. Only 1GB for DB and 576MB for wal. I would
recommend cranking up rocksdb cache size as much as possible. If you
have RAM you can also increase bluestores cache size for hdd. Default
is 1GB be as liberal as you can without getting OOM kills. You also
have lots of osd map loading and decoding in the log. Are you sure all
monitors/managers/osds are up to date? Plus make sure you aren't
forcing jemalloc loading. I had a funny interaction after upgrading to
mimic.
On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
<goktug.yildirim@xxxxxxxxx> wrote:

Hello Darius,

Thanks for reply!

The main problem is we can not query PGs. “ceph pg 67.54f query” does stucks and wait forever since OSD is unresponsive.
We are certain that OSD gets unresponsive as soon as it UP. And we are certain that OSD responds again after its disk utilization stops.

So we have a small test like that:
* Stop all OSDs (168 of them)
* Start OSD1. %95 osd disk utilization immediately starts. It takes 8 mins to finish. Only after that “ceph pg 67.54f query” works!
* While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 starts %95 disk utilization. This takes 17 minutes to finish.
* Now start OSD3 and it is the same. All OSDs start high I/O and it takes 25 mins to settle.
* If you happen to start 5 of them at the same all of the OSDs start high I/O again. And it takes 1 hour to finish.

So in the light of these findings we flagged noup, started all OSDs. At first there was no I/O. After 10 minutes we unset noup. All of 168 OSD started to make high I/O. And we thought that if we wait long enough it will finish & OSDs will be responsive again. After 24hours they did not because I/O did not finish or even slowed down.
One can think that is a lot of data there to scan. But it is just 33TB.

So at short we dont know which PG is stuck so we can remove it.

However we met an weird thing half an hour ago. We exported the same PG from two different OSDs. One was 4.2GB and the other is 500KB! So we decided to export all OSDs for backup. Then we will delete strange sized ones and start the cluster all over. Maybe then we could solve the stucked or unfound PGs as you advise.

Any thought would be greatly appreciated.

On 2 Oct 2018, at 18:16, Darius Kasparavičius <daznis@xxxxxxxxx> wrote:

Hello,

Currently you have 15 objects missing. I would recommend finding them
and making backups of them. Ditch all other osds that are failing to
start and concentrate on bringing online those that have missing
objects. Then slowly turn off nodown and noout on the cluster and see
if it stabilises. If it stabilises leave these setting if not turn
them back on.
Now get some of the pg's that are blocked and querry the pgs to check
why they are blocked. Try removing as much blocks as possible and then
remove the norebalance/norecovery flags and see if it starts to fix
itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
<morphinwithyou@xxxxxxxxx> wrote:

One of ceph experts indicated that bluestore is somewhat preview tech
(as for Redhat).
So it could be best to checkout bluestore and rocksdb. There are some
tools to check health and also repair. But there are limited
documentation.
Anyone who has experince with it?
Anyone lead/help to a proper check would be great.
Goktug Yildirim <goktug.yildirim@xxxxxxxxx>, 1 Eki 2018 Pzt, 22:55
tarihinde şunu yazdı:

Hi all,

We have recently upgraded from luminous to mimic. It’s been 6 days since this cluster is offline. The long short story is here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html

I’ve also CC’ed developers since I believe this is a bug. If this is not to correct way I apology and please let me know.

For the 6 days lots of thing happened and there were some outcomes about the problem. Some of them was misjudged and some of them are not looked deeper.
However the most certain diagnosis is this: each OSD causes very high disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs become unresponsive or very very less responsive. For example "ceph tell osd.x version” stucks like for ever.

So due to unresponsive OSDs cluster does not settle. This is our problem!

This is the one we are very sure of. But we are not sure of the reason.

Here is the latest ceph status:
https://paste.ubuntu.com/p/2DyZ5YqPjh/.

This is the status after we started all of the OSDs 24 hours ago.
Some of the OSDs are not started. However it didnt make any difference when all of them was online.

Here is the debug=20 log of an OSD which is same for all others:
https://paste.ubuntu.com/p/8n2kTvwnG6/
As we figure out there is a loop pattern. I am sure it wont caught from eye.

This the full log the same OSD.
https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0

Here is the strace of the same OSD process:
https://paste.ubuntu.com/p/8n2kTvwnG6/

Recently we hear more to uprade mimic. I hope none get hurts as we do. I am sure we have done lots of mistakes to let this happening. And this situation may be a example for other user and could be a potential bug for ceph developer.

Any help to figure out what is going on would be great.

Best Regards,
Goktug Yildirim
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com