Re: OSDs cannot match up with fast OSD map changes (epochs) during recovery

Muthusamy Muthiah <muthiah.muthusamy@xxxxxxxxx> · Mon, 13 Feb 2017 17:27:10 +0530

Hi All,
We also have same issue on one of our platforms which was upgraded from 11.0.2 to 11.2.0 . The issue occurs on one node alone where CPU hits 100% and OSDs of that node marked down. Issue not seen on cluster which was installed from scratch with 11.2.0. 

[root@xxxxxxxxxx ~] #
systemctl start ceph-osd@315.service

[root@xxxxxxxxxx ~] # cd /var/log/ceph/

[root@xxxxxxxxxx ceph] # tail -f *osd*315.log

2017-02-13 11:29:46.752897 7f995c79b940  0 <cls>
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/11.2.0/rpm/el7/BUILD/ceph-11.2.0/src/cls/hello/cls_hello.cc:296:
loading cls_hello

2017-02-13 11:29:46.753065 7f995c79b940  0 _get_class not permitted to
load kvs

2017-02-13 11:29:46.757571 7f995c79b940  0 _get_class not permitted to
load lua

2017-02-13 11:29:47.058720 7f995c79b940  0 osd.315 44703 crush map has
features 288514119978713088, adjusting msgr requires for clients

2017-02-13 11:29:47.058728 7f995c79b940  0 osd.315 44703 crush map has
features 288514394856620032 was 8705, adjusting msgr requires for mons

2017-02-13 11:29:47.058732 7f995c79b940  0 osd.315 44703 crush map has
features 288531987042664448, adjusting msgr requires for osds

2017-02-13 11:29:48.343979 7f995c79b940  0 osd.315 44703 load_pgs

2017-02-13 11:29:55.913550 7f995c79b940  0 osd.315 44703 load_pgs opened
130 pgs

2017-02-13 11:29:55.913604 7f995c79b940  0 osd.315 44703 using 1 op queue
with priority op cut off at 64.

2017-02-13 11:29:55.914102 7f995c79b940 -1 osd.315 44703 log_to_monitors {default=true}

2017-02-13 11:30:19.384897 7f9939bbb700  1 heartbeat_map reset_timeout
'tp_osd thread tp_osd' had timed out after 15

2017-02-13 11:30:31.073336 7f9955a2b700  1 heartbeat_map is_healthy
'tp_osd thread tp_osd' had timed out after 15

2017-02-13 11:30:31.073343 7f9955a2b700  1 heartbeat_map is_healthy
'tp_osd thread tp_osd' had timed out after 15

2017-02-13 11:30:31.073344 7f9955a2b700  1 heartbeat_map is_healthy
'tp_osd thread tp_osd' had timed out after 15

2017-02-13 11:30:31.073345 7f9955a2b700  1 heartbeat_map is_healthy
'tp_osd thread tp_osd' had timed out after 15

2017-02-13 11:30:31.073347 7f9955a2b700  1 heartbeat_map is_healthy
'tp_osd thread tp_osd' had timed out after 15

2017-02-13 11:30:31.073348 7f9955a2b700  1 heartbeat_map is_healthy
'tp_osd thread tp_osd' had timed out after 15
2017-02-13 11:30:54.772516 7f995c79b940  0 osd.315 44703 done with init,
starting boot process

Thanks,
Muthu

On 13 February 2017 at 10:50, Andreas Gerstmayr <andreas.gerstmayr@xxxxxxxxx> wrote:
Hi,

Due to a faulty upgrade from Jewel 10.2.0 to Kraken 11.2.0 our test

cluster is unhealthy since about two weeks and can't recover itself

anymore (unfortunately I skipped the upgrade to 10.2.5 because I

missed the ".z" in "All clusters must first be upgraded to Jewel

10.2.z").

Immediately after the upgrade I saw the following in the OSD logs:

s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing

to send and in the half  accept state just closed

There are also missed heartbeats in the OSD logs, and the OSDs which

don't send heartbeats have the following in their logs:

2017-02-08 19:44:51.367828 7f9be8c37700  1 heartbeat_map is_healthy

'tp_osd thread tp_osd' had timed out after 15

2017-02-08 19:44:54.271010 7f9bc4e96700  1 heartbeat_map reset_timeout

'tp_osd thread tp_osd' had timed out after 15

During investigating we found out that some OSDs were lagging about

100-20000 OSD map epochs behind. The monitor publishes new epochs

every few seconds, but the OSD daemons are pretty slow in applying

them (up to a few minutes for 100 epochs). During recovery of the 24

OSDs of a storage node the CPU is running at almost 100% (the nodes

have 16 real cores, or 32 with Hyper-Threading).

We had at times servers where all 24 OSDs were up-to-date with the

latest OSD map, but somehow they lost it and were lagging behind

again. During recovery some OSDs used up to 25 GB of RAM, which led to

out of memory and further lagging of the OSDs of the affected server.

We already set the nodown, noout, norebalance, nobackfill, norecover,

noscrub and nodeep-scrub flags to prevent OSD flapping and even more

new OSD epochs.

Is there anything we can do to let the OSDs recover? It seems that the

servers don't have enough CPU resources for recovery. I already played

around with the osd map message max setting (when I increased it to

1000 to speed up recovery, the OSDs didn't get any updates at all?),

and the osd heartbeat grace and osd thread timeout settings (to give

the overloaded server more time), but without success so far. I've

seen errors related to the AsyncMessenger in the logs, so I reverted

back to the SimpleMessenger (which was working successfully with

Jewel).

Cluster details:

6 storage nodes with 2x Intel Xeon E5-2630 v3 8x2.40GHz

256GB RAM

Each storage node has 24 HDDs attached, one OSD per disk, journal on same disk

3 monitors in total, co-located with the storage nodes

separate front and back network (10 Gbit)

OS: CentOS 7.2.1511

Kernel: 4.9.8-1.el7.elrepo.x86_64 from elrepo.org

Thanks,

Andreas

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com