Re: Forced upgrade OSD from Luminous to Pacific

Frank Schilder <frans@xxxxxx> · Wed, 9 Oct 2024 10:45:53 +0000

Hi Alex,

that it shows that weird state for OSDs when no OSDs are connected is normal. There is a setting in the MONs that prevents OSDs being marked out if more than X% are down. I think X is 30. That's why you see 3 OSDs in but only 1 up. Its not the real state.

For recovery it might be useful to have the MONs report a proper stats. i think you can change the threshold by adjusting "mon osd min up ratio"; see here: https://docs.ceph.com/en/octopus/rados/configuration/mon-osd-interaction/#osds-report-down-osds (pick your ceph version).

You should check networking and OSD logs. Maybe they are corrupted? Do they manage to read the rocksdb and get to the state where they try to join the cluster? Do they crash?

You can start an OSD daemon manually to see he complete startup log live in a terminal.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Alex Rydzewski <rydzewski.al@xxxxxxxxx>
Sent: Wednesday, October 9, 2024 12:26 PM
To: Frédéric Nass
Cc: ceph-users
Subject:  Re: Forced upgrade OSD from Luminous to Pacific

Hello, Frédéric!

1.
First I repaired mon when ceph was Luminous but it wouldn't start with
some error I don't remember. Then I upgrade ceph and repeat repair
procedure and I then upgrade ceph and repeated the restore procedure and
mon started. Now I can query to it.
root@helper:~# ceph --version
ceph version 16.2.15 (12fd9dfef6998ac41c93f56885264a7d43a51b03) pacific
(stable)

root@helper:~# ceph -s
   cluster:
     id:     96b6ff1d-25bf-403f-be3d-78c2fb0ff747
     health: HEALTH_WARN
             mon is allowing insecure global_id reclaim
             2 osds down
             Reduced data availability: 351 pgs inactive
             2 pool(s) have non-power-of-two pg_num
             2 daemons have recently crashed

   services:
     mon: 1 daemons, quorum helper (age 21h)
     mgr: helper(active, since 21h)
     osd: 5 osds: 1 up, 3 in

   data:
     pools:   3 pools, 351 pgs
     objects: 0 objects, 0 B
     usage:   0 B used, 0 B / 0 B avail
     pgs:     100.000% pgs unknown
              351 unknown

root@helper:~# ceph osd tree
ID  CLASS  WEIGHT    TYPE NAME        STATUS  REWEIGHT  PRI-AFF
-1         18.19298  root default
-3         18.19298      host helper
  0    hdd   3.63860          osd.0      down         0  1.00000
  1    hdd   3.63860          osd.1        up   1.00000  1.00000
  2    hdd   3.63860          osd.2      down         0  1.00000
  3    hdd   3.63860          osd.3      down   1.00000  1.00000
  4    hdd   3.63860          osd.4      down   1.00000  1.00000

Although it has this state, there are actually no OSDs connected to it

root@helper:~# tail /var/log/ceph/ceph-osd.1.log
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read,
0.00 MB/s read, 0.0 seconds
Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00
MB/s read, 0.0 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0
level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for
pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0
memtable_compaction, 0 memtable_slowdown, interval 0 total count

** File Read Latency Histogram By Level [P] **

2024-10-09T13:17:09.716+0300 7eff91d8a700  1 osd.1 45887 tick checking
mon for new map
2024-10-09T13:17:39.864+0300 7eff91d8a700  1 osd.1 45887 tick checking
mon for new map

2. Yes, I upgraded MON and OSDs to Pacific

root@helper:~# ceph-osd --version
ceph version 16.2.15 (12fd9dfef6998ac41c93f56885264a7d43a51b03) pacific
(stable)
root@helper:~# ceph-mon --version
ceph version 16.2.15 (12fd9dfef6998ac41c93f56885264a7d43a51b03) pacific
(stable)

3.
Yes, now MON started and OSDs started, but they cannot connect to MON.
At the same time, the MON journal has a message:
/disallowing boot of octopus+ OSD osd.xx/
/
/
And I tried rebuild the MON with this ceph (Pacific) version and it is
running now

On 09.10.24 12:35, Frédéric Nass wrote:
> ----- Le 8 Oct 24, à 15:24, Alex Rydzewskirydzewski.al@xxxxxxxxx  a écrit :
>
>> Hello, dear community!
>>
>> I kindly ask for your help in resolving my issue.
>>
>> I have a server with a single-node CEPH setup with 5 OSDs. This server
>> has been powered off for about two years, and when I needed the data
>> from it, I found that the SSD where the system was installed had died.
>>
>> I tried to recover the cluster. First, assuming the old CEPH is there I
>> installed Debian 10 with CEPH 12.2.11, mounted the OSDs to
>> /var/lib/ceph/osd/ceph-xx and assembled the monitor, as described here
>> https://forum.proxmox.com/threads/recover-ceph-from-osds-only.113699/.
>>
>> However, the monitor wouldn't start, giving an error I don't remember.
>> Then I made a series of mistakes, upgrading the system and CEPH first to
>> Nautilus and then to Pacific. Eventually, I managed to start the
>> monitor, but a compatibility issue with the OSDs remains.
>>
>> When the OSDs start, I see the message: /check_osdmap_features
>> require_osd_release unknown -> luminous
>> /At the same time, the monitor log shows: /disallowing boot of octopus+
>> OSD osd.xx.
>> /After starting, the OSD remains in the state: /tick checking mon for
>> new map/
> Hi Alex,
>
> Correct me if I got this wrong:
>
> 1. You repaired the MON database while OSDs were still on Luminous
> 2. You upgraded MONs and OSDs to Pacific
> 3. MONs now start but won't allow Pacific OSDs to join the cluster
>
> Have you tried repairing the MON database again, now that the OSDs are running Pacific? (Make sure to back up the previously repaired MON database before attempting this.)
>
> Regards,
> Frédéric
>
>>
>> Then I enabledmsgrv2 protocol and tried enabling RocksDB sharding for
>> the OSD, as described here
>> https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#bluestore-rocksdb-sharding,
>> but it didn’t help.
>>
>> Attempts to start the OSD with lower versions of CEPH event with Octopus
>> end with the error:
>> /2024-10-08 10:45:38.402975 7fba61b34ec0 -1 bluefs _replay 0x0: stop:
>> unrecognized op 12
>> 2024-10-08 10:45:38.402992 7fba61b34ec0 -1 bluefs mount failed to replay
>> log: (5) Input/output error/
>>
>>
>> So, currently, I have CEPH 16.2.15, and the OSD is in the following state:
>>
>> /"/var/lib/ceph/osd/ceph-1/block": {
>>      "osd_uuid": "2bb56721-28c7-45cc-9344-6cc5c699a642",
>>      "size": 4000681103360,
>>      "btime": "2018-06-02 13:16:57.042205",
>>      "description": "main",
>>      "bfm_blocks": "61045632",
>>      "bfm_blocks_per_key": "128",
>>      "bfm_bytes_per_block": "65536",
>>      "bfm_size": "4000681099264",
>>      "bluefs": "1",
>>      "ceph_fsid": "96b6ff1d-25bf-403f-be3d-78c2fb0ff747",
>>      "kv_backend": "rocksdb",
>>      "magic": "ceph osd volume v026",
>>      "mkfs_done": "yes",
>>      "ready": "ready",
>>      "require_osd_release": "12",
>>      "whoami": "1"
>> }/
>>
>> with modified RocksDB to enable sharding.
>>
>>
>> Suggest me, please, Is there a way to upgrade such OSDs so they can run
>> with this version of Ceph?
>>
>> If you need more information here, let me know and I will provide
>> whatever is needed.
>>
>> --
>> Alexander Rydzewski
>> _______________________________________________
>> ceph-users mailing list --ceph-users@xxxxxxx
>> To unsubscribe send an email toceph-users-leave@xxxxxxx

--
Олександр Ридзевський
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx