Re: failed to load OSD map for epoch 2898146, got 0 bytes

Alex Walender <awalende@xxxxxxxxxxxxxxxxxxxxxxxx> · Thu, 24 Oct 2024 09:07:43 +0200

Hey all,

I had a very similar issue years back.
OSDs would take a long time starting when they were out for a while 
(like a few weeks).
The counter was starting over and over again since the OSD service would 
restart itself after a while.

In my case, the issue was that there was a new OSD epoch every second, 
which is obviously wrong.
I was able to observe this in the MON logs.

I had forgotten to set the require-osd-release after a version upgrade.
After that, the number of new epochs drastically decreased and future 
"offline" OSDs were able to join back
much quicker.

Best Regards,
Alex Walender

Am 21.10.24 um 22:31 schrieb Vladimir Sigunov:
Hi Dan and Frank,
 From my experience, if an osd was down for a long period of time, it could take more than one manual restart for this osd to catch up an actual epoch. Under manual restart I mean systemctl reset-failed && systemctl  restart <system unit>.
The "warm up" time could be up to 15 minutes.
Last time I saw (and fixed the issue with the steps above) about a month ago on 18.2.2 cluster.

Hope, this workaround will help.

Sincerely,
Vladimir.

Get Outlook for Android<https://aka.ms/AAb9ysg>
________________________________
From: Dan van der Ster <dan.vanderster@xxxxxxxxx>
Sent: Monday, October 21, 2024 3:03:41 PM
To: Frank Schilder <frans@xxxxxx>
Cc: ceph-users@xxxxxxx <ceph-users@xxxxxxx>
Subject: [ceph-users] Re: failed to load OSD map for epoch 2898146, got 0 bytes

Hi Frank,

Are you sure it's looping over the same epochs?
It looks like that old osd is trying to catch up on all the osdmaps it
missed while it was down. (And those old maps are probably trimmed
from all the mons and osds, based on the "got 0 bytes" error).
Eventually it should catch up to the current (e 2971464 according to
your log), and then the PGs can go active.

Cheers, Dan

--
Dan van der Ster
CTO @ CLYSO
https://clyso.com | dan.vanderster@xxxxxxxxx

On Mon, Oct 21, 2024 at 9:13 AM Frank Schilder <frans@xxxxxx> wrote:
Hi all,

I have a strange problem on an octopus latest cluster.  We had a couple of SSD OSDs down for a while and brought them up today again. For some reason, these OSDs don't come up and flood the log with messages like

osd.1004 2971464 failed to load OSD map for epoch 2898146, got 0 bytes

These messages cycle through the same epochs over and over again. I did not really fine too much help, there is an old thread about a similar/the same problem on a home lab cluster, with new OSDs though, I believe. I couldn't really find useful information. The OSDs seem to boot fine and then end up in something like a death loop. Below some snippets from the OSD log.

Any hints appreciated.
Thanks and best regards,
Frank

After OSD start, everything looks normal up to here:

2024-10-21T17:41:39.136+0200 7fad73cf6f00  0 osd.1004 2971464 load_pgs opened 205 pgs
2024-10-21T17:41:39.140+0200 7fad73cf6f00 -1 osd.1004 2971464 log_to_monitors {default=true}
2024-10-21T17:41:39.150+0200 7fad73cf6f00 -1 osd.1004 2971464 mon_cmd_maybe_osd_create fail: 'osd.1004 has already bound to class 'fs_meta', can not reset class to 'ssd'; use 'ceph osd crush
  rm-device-class <id>' to remove old class first': (16) Device or resource busy
2024-10-21T17:41:39.155+0200 7fad519a3700 -1 osd.1004 2971464 failed to load OSD map for epoch 2898132, got 0 bytes
2024-10-21T17:41:39.155+0200 7fad511a2700 -1 osd.1004 2971464 failed to load OSD map for epoch 2898132, got 0 bytes
2024-10-21T17:41:39.155+0200 7fad511a2700 -1 osd.1004 2971464 failed to load OSD map for epoch 2898133, got 0 bytes
2024-10-21T17:41:39.155+0200 7fad511a2700 -1 osd.1004 2971464 failed to load OSD map for epoch 2898134, got 0 bytes
2024-10-21T17:41:39.155+0200 7fad511a2700 -1 osd.1004 2971464 failed to load OSD map for epoch 2898135, got 0 bytes
2024-10-21T17:41:39.155+0200 7fad511a2700 -1 osd.1004 2971464 failed to load OSD map for epoch 2898136, got 0 bytes
2024-10-21T17:41:39.155+0200 7fad4f99f700 -1 osd.1004 2971464 failed to load OSD map for epoch 2898132, got 0 bytes
2024-10-21T17:41:39.155+0200 7fad4f99f700 -1 osd.1004 2971464 failed to load OSD map for epoch 2898133, got 0 bytes
2024-10-21T17:41:39.155+0200 7fad4b196700 -1 osd.1004 2971464 failed to load OSD map for epoch 2898132, got 0 bytes
2024-10-21T17:41:39.155+0200 7fad4b196700 -1 osd.1004 2971464 failed to load OSD map for epoch 2898133, got 0 bytes
2024-10-21T17:41:39.155+0200 7fad4b196700 -1 osd.1004 2971464 failed to load OSD map for epoch 2898134, got 0 bytes
2024-10-21T17:41:39.155+0200 7fad4b196700 -1 osd.1004 2971464 failed to load OSD map for epoch 2898135, got 0 bytes
2024-10-21T17:41:39.155+0200 7fad4b196700 -1 osd.1004 2971464 failed to load OSD map for epoch 2898136, got 0 bytes
2024-10-21T17:41:39.155+0200 7fad4b196700 -1 osd.1004 2971464 failed to load OSD map for epoch 2898137, got 0 bytes
2024-10-21T17:41:39.155+0200 7fad73cf6f00  0 osd.1004 2971464 done with init, starting boot process
2024-10-21T17:41:39.155+0200 7fad73cf6f00  1 osd.1004 2971464 start_boot
2024-10-21T17:41:39.155+0200 7fad4b196700 -1 osd.1004 2971464 failed to load OSD map for epoch 2898138, got 0 bytes
2024-10-21T17:41:39.155+0200 7fad4b196700 -1 osd.1004 2971464 failed to load OSD map for epoch 2898139, got 0 bytes
2024-10-21T17:41:39.155+0200 7fad4b196700 -1 osd.1004 2971464 failed to load OSD map for epoch 2898140, got 0 bytes
2024-10-21T17:41:39.155+0200 7fad4b196700 -1 osd.1004 2971464 failed to load OSD map for epoch 2898141, got 0 bytes
2024-10-21T17:41:39.155+0200 7fad4b196700 -1 osd.1004 2971464 failed to load OSD map for epoch 2898142, got 0 bytes
2024-10-21T17:41:39.156+0200 7fad4b196700 -1 osd.1004 2971464 failed to load OSD map for epoch 2898143, got 0 bytes
2024-10-21T17:41:39.156+0200 7fad4b196700 -1 osd.1004 2971464 failed to load OSD map for epoch 2898144, got 0 bytes
2024-10-21T17:41:39.156+0200 7fad4b196700 -1 osd.1004 2971464 failed to load OSD map for epoch 2898145, got 0 bytes
2024-10-21T17:41:39.156+0200 7fad4b196700 -1 osd.1004 2971464 failed to load OSD map for epoch 2898146, got 0 bytes

These messages repeat over and over again with some others of this form showing up every now and then:

2024-10-21T17:41:39.476+0200 7fad651ca700  4 rocksdb: [db/compaction_job.cc:1332] [default] [JOB 12] Generated table #82879: 76571 keys, 67866714 bytes
2024-10-21T17:41:39.688+0200 7fad651ca700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1729525299690000, "cf_name": "default", "job": 12, "event": "table_file_creation", "file_number": 82879, "file_size": 67866714, "table_properties": {"data_size": 67111697, "index_size": 562601, "filter_size": 191557, "raw_key_size": 4823973, "raw_average_key_size": 63, "raw_value_size": 62631087, "raw_average_value_size": 817, "num_data_blocks": 15644, "num_entries": 76571, "filter_policy_name": "rocksdb.BuiltinBloomFilter"}}

And another occasion:

2024-10-21T17:41:40.520+0200 7fad651ca700  4 rocksdb: [db/compaction_job.cc:1332] [default] [JOB 12] Generated table #82880: 76774 keys, 67868330 bytes
2024-10-21T17:41:40.520+0200 7fad501a0700 -1 osd.1004 2971464 failed to load OSD map for epoch 2899234, got 0 bytes
2024-10-21T17:41:40.520+0200 7fad501a0700 -1 osd.1004 2971464 failed to load OSD map for epoch 2899235, got 0 bytes
2024-10-21T17:41:40.520+0200 7fad501a0700 -1 osd.1004 2971464 failed to load OSD map for epoch 2899236, got 0 bytes
2024-10-21T17:41:40.520+0200 7fad651ca700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1729525300521403, "cf_name": "default", "job": 12, "event": "table_file_creation", "file_number": 82880, "file_size": 67868330, "table_properties": {"data_size": 67113021, "index_size": 562509, "filter_size": 191941, "raw_key_size": 4836742, "raw_average_key_size": 62, "raw_value_size": 62623274, "raw_average_value_size": 815, "num_data_blocks": 15630, "num_entries": 76774, "filter_policy_name": "rocksdb.BuiltinBloomFilter"}}

=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
M.Sc Alex Walender

Forschungszentrum Jülich
Institut für Bio- und Geowissenschaften
IBG 5 - Computergestützte Metagenomik /
        de.NBI Cloud Site Bielefeld

Büro : Universität Bielefeld (UHG), M3-118
Tel. :  +49-521-106-2907

Attachment:
OpenPGP_0xB8E94FB3F9EAFED3.asc

Description: OpenPGP public key
Attachment:
OpenPGP_signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx