Re: just-rebuilt mon does not join the cluster

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Is there a chance you might have seen this https://tracker.ceph.com/issues/49231 ?

Do you have network monitoring with packet reports? It is possible though that you have observed something new.

Your cluster comes from pre-luminous times? The issue with dropping support for level-db was discussed in the user list some time ago. There were instructions how to upgrade the mon store, which should happen before starting a ceph upgrade. Seems like the info didn't make it into the upgrade instructions.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Jan Kasprzak <kas@xxxxxxxxxx>
Sent: 09 September 2022 10:43:37
To: ceph-users@xxxxxxx
Subject:  Re: just-rebuilt mon does not join the cluster

TL;DR: my cluster is working now. Details and further problems below:

Jan Kasprzak wrote:
: I did
:
: ceph tell mon.* config set mon_sync_max_payload_size 4096
: ceph config set mon mon_sync_max_payload_size 4096
:
: and added "mon_sync_max_payload_size = 4096" into the [global] section
: of the mon host to be (re-)added, and ran
:
: systemctl restart ceph-mon@mon1.service
:
: on that host. But it did not help - mon1 did not join the cluster.

        I let it settle over the night, and apparently about four hours
after I did the above and let ceph-mon running without touching anything
further, the newly configured mon has successfully joined the cluster.

        So I upgraded mon1 to Quincy went on and tried to upgrade
another mon, named mon3. I also had to remove it from cluster and reinitialize
its data directory (apparently Quincy mon cannot handle my leveldb
and just crashed, so I mkfs'd a new rocksdb data directory).
Mon3 got registered to the cluster successfully about 10-20 seconds after
the start, but it could not join the quorum. There were the following
lines in the log file, repeated every second or so:

2022-09-09T09:01:52.611+0200 7f8ecc968700  0 log_channel(cluster) log [INF] : mo
n.mon3 calling monitor election
2022-09-09T09:01:52.611+0200 7f8ecc968700  1 paxos.2).electionLogic(62993) init, last seen epoch 62993, mid-election, bumping
2022-09-09T09:01:52.661+0200 7f8ecc968700  1 mon.mon3@2(electing) e33 collect_metadata md127:  no unique device id for md127: fallback method has no model nor serial'

(/dev/md127 is my root filesystem, RAID-1)

When I stopped the ceph-mon@mon3.service, ceph -s reported many slow ops
on other two mons, mon1 and mon2, for a while. After another mkfs
and several restarts, it entered the quorum successfully. I don't know
what made the difference.

So I upgraded also mon2 with similar problems, but it also eventually
managed to join the cluster and the quorum.

-Yenya

--
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
| http://www.fi.muni.cz/~kas/                         GPG: 4096R/A45477D5 |
    We all agree on the necessity of compromise. We just can't agree on
    when it's necessary to compromise.                     --Larry Wall
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux