Re: 1/3 mons down! mon do not rejoin

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Yes, the empty DB told me that at this point I had no other choice
than recreate the entire mon service.

* remove broken mon
  ceph mon remove $(hostname -s)

* mon preparation done
  rm -rf /var/lib/ceph/mon/ceph-$(hostname -s)
  mkdir /var/lib/ceph/mon/ceph-$(hostname -s)
  ceph auth get mon. -o /tmp/mon-keyfile
  ceph mon getmap -o /tmp/mon-monmap
  ceph-mon -i $(hostname -s) --mkfs --monmap /tmp/mon-monmap --keyring
/tmp/mon-keyfile
  chown -R ceph: /var/lib/ceph/mon/ceph-$(hostname -s)

will wait for low-traffic time on the cluster to enable the recreated mon

thanks for all the help so far
Ansgar

Am Mo., 26. Juli 2021 um 15:39 Uhr schrieb Dan van der Ster
<dan@xxxxxxxxxxxxxx>:
>
> Your log ends with
>
> > 2021-07-25 06:46:52.078 7fe065f24700  1 mon.osd01@0(leader).osd e749666 do_prune osdmap full prune enabled
>
> So mon.osd01 was still the leader at that time.
> When did it leave the cluster?
>
> > I also found that the rocksdb on osd01 is only 1MB in size and 345MB on the other mons!
>
> It sounds like mon.osd01's db has been re-initialized as empty, e.g.
> maybe the directory was lost somehow between reboots?
>
> -- dan
>
>
> On Mon, Jul 26, 2021 at 1:55 PM Ansgar Jazdzewski
> <a.jazdzewski@xxxxxxxxxxxxxx> wrote:
> >
> > Hi Dan, Hi Folks,
> >
> > this is how things started, I also found that the rocksdb on osd01 is
> > only 1MB in size and 345MB on the other mons!
> >
> > 2021-07-25 06:46:30.029 7fe061f1c700  0 log_channel(cluster) log [DBG]
> > : monmap e1: 3 mons at
> > {osd01=[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0],osd02=[v2:10.152.28.172:3300/0,v1:10.152.28.172:6789/0],osd03=[v2:10.152.28.173:3300/0,v1:10.152.28.173:6789/0]}
> > 2021-07-25 06:46:30.029 7fe061f1c700  0 log_channel(cluster) log [DBG]
> > : fsmap cephfs:1 {0=osd01=up:active} 2 up:standby
> > 2021-07-25 06:46:30.029 7fe061f1c700  0 log_channel(cluster) log [DBG]
> > : osdmap e749665: 436 total, 436 up, 436 in
> > 2021-07-25 06:46:30.029 7fe061f1c700  0 log_channel(cluster) log [DBG]
> > : mgrmap e89: osd03(active, since 13h), standbys: osd01, osd02
> > 2021-07-25 06:46:30.029 7fe061f1c700  0 log_channel(cluster) log [INF]
> > : overall HEALTH_OK
> > 2021-07-25 06:46:30.805 7fe065f24700  1 mon.osd01@0(leader).osd
> > e749665 do_prune osdmap full prune enabled
> > 2021-07-25 06:46:30.957 7fe06371f700  0 mon.osd01@0(leader) e1
> > handle_command mon_command({"prefix": "status"} v 0) v1
> > 2021-07-25 06:46:30.957 7fe06371f700  0 log_channel(audit) log [DBG] :
> > from='client.? 10.152.28.171:0/3290370429' entity='client.admin'
> > cmd=[{"prefix": "status"}]: dispatch
> > 2021-07-25 06:46:51.922 7fe065f24700  1 mon.osd01@0(leader).mds e85
> > tick: resetting beacon timeouts due to mon delay (slow election?) of
> > 20.3627s seconds
> > 2021-07-25 06:46:51.922 7fe065f24700 -1 mon.osd01@0(leader) e1
> > get_health_metrics reporting 13 slow ops, oldest is pool_op(delete
> > unmanaged snap pool 3 tid 27666 name  v749664)
> > 2021-07-25 06:46:51.930 7fe06371f700  0 log_channel(cluster) log [INF]
> > : mon.osd01 calling monitor election
> > 2021-07-25 06:46:51.930 7fe06371f700  1
> > mon.osd01@0(electing).elector(173) init, last seen epoch 173,
> > mid-election, bumping
> > 2021-07-25 06:46:51.946 7fe06371f700  1 mon.osd01@0(electing) e1
> > collect_metadata :  no unique device id for : fallback method has no
> > model nor serial'
> > 2021-07-25 06:46:51.962 7fe067727700  1 mon.osd01@0(electing) e1
> > handle_auth_request failed to assign global_id
> > 2021-07-25 06:46:51.962 7fe067727700  1 mon.osd01@0(electing) e1
> > handle_auth_request failed to assign global_id
> > 2021-07-25 06:46:51.962 7fe067727700  1 mon.osd01@0(electing) e1
> > handle_auth_request failed to assign global_id
> > 2021-07-25 06:46:51.962 7fe067727700  1 mon.osd01@0(electing) e1
> > handle_auth_request failed to assign global_id
> > 2021-07-25 06:46:51.962 7fe067727700  1 mon.osd01@0(electing) e1
> > handle_auth_request failed to assign global_id
> > 2021-07-25 06:46:51.962 7fe067727700  1 mon.osd01@0(electing) e1
> > handle_auth_request failed to assign global_id
> > 2021-07-25 06:46:51.962 7fe067727700  1 mon.osd01@0(electing) e1
> > handle_auth_request failed to assign global_id
> > 2021-07-25 06:46:51.962 7fe067727700  1 mon.osd01@0(electing) e1
> > handle_auth_request failed to assign global_id
> > 2021-07-25 06:46:51.966 7fe067727700  1 mon.osd01@0(electing) e1
> > handle_auth_request failed to assign global_id
> > 2021-07-25 06:46:51.966 7fe067727700  1 mon.osd01@0(electing) e1
> > handle_auth_request failed to assign global_id
> > 2021-07-25 06:46:51.970 7fe06371f700  0 log_channel(cluster) log [INF]
> > : mon.osd01 is new leader, mons osd01,osd02,osd03 in quorum (ranks
> > 0,1,2)
> > 2021-07-25 06:46:52.002 7fe06371f700  1 mon.osd01@0(leader).osd
> > e749666 e749666: 436 total, 436 up, 436 in
> > 2021-07-25 06:46:52.026 7fe06371f700  0 log_channel(cluster) log [DBG]
> > : monmap e1: 3 mons at
> > {osd01=[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0],osd02=[v2:10.152.28.172:3300/0,v1:10.152.28.172:6789/0],osd03=[v2:10.152.28.173:3300/0,v1:10.152.28.173:6789/0]}
> > 2021-07-25 06:46:52.026 7fe06371f700  0 log_channel(cluster) log [DBG]
> > : fsmap cephfs:1 {0=osd01=up:active} 2 up:standby
> > 2021-07-25 06:46:52.026 7fe06371f700  0 log_channel(cluster) log [DBG]
> > : osdmap e749666: 436 total, 436 up, 436 in
> > 2021-07-25 06:46:52.026 7fe06371f700  0 log_channel(cluster) log [DBG]
> > : mgrmap e89: osd03(active, since 13h), standbys: osd01, osd02
> > 2021-07-25 06:46:52.026 7fe06371f700  0 log_channel(cluster) log [INF]
> > : Health check cleared: MON_DOWN (was: 1/3 mons down, quorum
> > osd02,osd03)
> > 2021-07-25 06:46:52.042 7fe061f1c700  0 log_channel(cluster) log [WRN]
> > : Health detail: HEALTH_WARN 7 slow ops, oldest one blocked for 36
> > sec, daemons [mon.osd02,mon.osd03] have slow ops.
> > 2021-07-25 06:46:52.042 7fe061f1c700  0 log_channel(cluster) log [WRN]
> > : SLOW_OPS 7 slow ops, oldest one blocked for 36 sec, daemons
> > [mon.osd02,mon.osd03] have slow ops.
> > 2021-07-25 06:46:52.078 7fe065f24700  1 mon.osd01@0(leader).osd
> > e749666 do_prune osdmap full prune enabled
> >
> > Am Mo., 26. Juli 2021 um 09:45 Uhr schrieb Dan van der Ster
> > <dan@xxxxxxxxxxxxxx>:
> > >
> > > Hi,
> > >
> > > Do you have ceph-mon logs from when mon.osd01 first failed before the
> > > on-call team rebooted it? They might give a clue what happened to
> > > start this problem, which maybe is still happening now.
> > >
> > > This looks similar but it was eventually found to be a network issue:
> > > https://tracker.ceph.com/issues/48033
> > >
> > > -- Dan
> > >
> > > On Sun, Jul 25, 2021 at 6:36 PM Ansgar Jazdzewski
> > > <a.jazdzewski@xxxxxxxxxxxxxx> wrote:
> > > >
> > > > Am So., 25. Juli 2021 um 18:02 Uhr schrieb Dan van der Ster
> > > > <dan@xxxxxxxxxxxxxx>:
> > > > >
> > > > > What do you have for the new global_id settings? Maybe set it to allow insecure global_id auth and see if that allows the mon to join?
> > > >
> > > >  auth_allow_insecure_global_id_reclaim is allowed as we still have
> > > > some VM's not restarted
> > > >
> > > > # ceph config get mon.*
> > > > WHO MASK LEVEL    OPTION                                         VALUE RO
> > > > mon      advanced auth_allow_insecure_global_id_reclaim          true
> > > > mon      advanced mon_warn_on_insecure_global_id_reclaim         false
> > > > mon      advanced mon_warn_on_insecure_global_id_reclaim_allowed false
> > > >
> > > > > > I can try to move the /var/lib/ceph/mon/ dir and recreate it!?
> > > > >
> > > > > I'm not sure it will help. Running the mon with --debug_ms=1 might give clues why it's stuck probing.
> > > >
> > > > 2021-07-25 16:28:41.418 7fcc613d8700 10 mon.osd01@0(probing) e1
> > > > probing other monitors
> > > > 2021-07-25 16:28:41.418 7fcc613d8700  1 --
> > > > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] send_to--> mon
> > > > [v2:10.152.28.172:3300/0,v1:10.152.28.172:6789/0] -- mon_probe(probe
> > > > a6baa789-6be2-4ce0-ab2d-7c78b899d4bd name osd01 mon_release 14) v7 --
> > > > ?+0 0x55c6b35ae780
> > > > 2021-07-25 16:28:41.418 7fcc613d8700  1 --
> > > > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] -->
> > > > [v2:10.152.28.172:3300/0,v1:10.152.28.172:6789/0] -- mon_probe(probe
> > > > a6baa789-6be2-4ce0-ab2d-7c78b899d4bd name osd01 mon_release 14) v7 --
> > > > 0x55c6b35ae780 con 0x55c6b2611180
> > > > 2021-07-25 16:28:41.418 7fcc613d8700  1 --
> > > > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] send_to--> mon
> > > > [v2:10.152.28.173:3300/0,v1:10.152.28.173:6789/0] -- mon_probe(probe
> > > > a6baa789-6be2-4ce0-ab2d-7c78b899d4bd name osd01 mon_release 14) v7 --
> > > > ?+0 0x55c6b35aea00
> > > > 2021-07-25 16:28:41.418 7fcc613d8700  1 --
> > > > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] -->
> > > > [v2:10.152.28.173:3300/0,v1:10.152.28.173:6789/0] -- mon_probe(probe
> > > > a6baa789-6be2-4ce0-ab2d-7c78b899d4bd name osd01 mon_release 14) v7 --
> > > > 0x55c6b35aea00 con 0x55c6b2611600
> > > > 2021-07-25 16:28:41.814 7fcc5dbd1700  1 --2-
> > > > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
> > > > [v2:10.152.28.173:3300/0,v1:10.152.28.173:6789/0] conn(0x55c6b2611600
> > > > 0x55c6b3323c00 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=0 rev1=1
> > > > rx=0 tx=0)._handle_peer_banner_payload supported=1 required=0
> > > > 2021-07-25 16:28:41.814 7fcc62bdb700  1 --2-
> > > > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
> > > > [v2:10.152.28.172:3300/0,v1:10.152.28.172:6789/0] conn(0x55c6b2611180
> > > > 0x55c6b3323500 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=0 rev1=1
> > > > rx=0 tx=0)._handle_peer_banner_payload supported=1 required=0
> > > > 2021-07-25 16:28:41.814 7fcc62bdb700 10 mon.osd01@0(probing) e1
> > > > ms_get_authorizer for mon
> > > > 2021-07-25 16:28:41.814 7fcc5dbd1700 10 mon.osd01@0(probing) e1
> > > > ms_get_authorizer for mon
> > > > 2021-07-25 16:28:41.814 7fcc62bdb700  1 --
> > > > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
> > > > [v2:10.152.28.172:3300/0,v1:10.152.28.172:6789/0] conn(0x55c6b2611180
> > > > msgr2=0x55c6b3323500 secure :-1 s=STATE_CONNECTION_ESTABLISHED
> > > > l=0).read_bulk peer close file descriptor 27
> > > > 2021-07-25 16:28:41.814 7fcc62bdb700  1 --
> > > > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
> > > > [v2:10.152.28.172:3300/0,v1:10.152.28.172:6789/0] conn(0x55c6b2611180
> > > > msgr2=0x55c6b3323500 secure :-1 s=STATE_CONNECTION_ESTABLISHED
> > > > l=0).read_until read failed
> > > > 2021-07-25 16:28:41.814 7fcc62bdb700  1 --2-
> > > > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
> > > > [v2:10.152.28.172:3300/0,v1:10.152.28.172:6789/0] conn(0x55c6b2611180
> > > > 0x55c6b3323500 secure :-1 s=SESSION_CONNECTING pgs=0 cs=0 l=0 rev1=1
> > > > rx=0x55c6b34bbad0 tx=0x55c6b3528130).handle_read_frame_preamble_main
> > > > read frame preamble failed r=-1 ((1) Operation not permitted)
> > > > 2021-07-25 16:28:41.814 7fcc5dbd1700  1 --
> > > > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
> > > > [v2:10.152.28.173:3300/0,v1:10.152.28.173:6789/0] conn(0x55c6b2611600
> > > > msgr2=0x55c6b3323c00 secure :-1 s=STATE_CONNECTION_ESTABLISHED
> > > > l=0).read_bulk peer close file descriptor 28
> > > > 2021-07-25 16:28:41.814 7fcc5dbd1700  1 --
> > > > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
> > > > [v2:10.152.28.173:3300/0,v1:10.152.28.173:6789/0] conn(0x55c6b2611600
> > > > msgr2=0x55c6b3323c00 secure :-1 s=STATE_CONNECTION_ESTABLISHED
> > > > l=0).read_until read failed
> > > > 2021-07-25 16:28:41.814 7fcc5dbd1700  1 --2-
> > > > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
> > > > [v2:10.152.28.173:3300/0,v1:10.152.28.173:6789/0] conn(0x55c6b2611600
> > > > 0x55c6b3323c00 secure :-1 s=SESSION_CONNECTING pgs=0 cs=0 l=0 rev1=1
> > > > rx=0x55c6b3553830 tx=0x55c6b34809a0).handle_read_frame_preamble_main
> > > > read frame preamble failed r=-1 ((1) Operation not permitted)
> > > > 2021-07-25 16:28:41.814 7fcc62bdb700  1 --2-
> > > > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
> > > > [v2:10.152.28.172:3300/0,v1:10.152.28.172:6789/0] conn(0x55c6b2611180
> > > > 0x55c6b3323500 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=0 rev1=1 rx=0
> > > > tx=0)._fault waiting 15.000000
> > > > 2021-07-25 16:28:41.814 7fcc5dbd1700  1 --2-
> > > > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
> > > > [v2:10.152.28.173:3300/0,v1:10.152.28.173:6789/0] conn(0x55c6b2611600
> > > > 0x55c6b3323c00 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=0 rev1=1 rx=0
> > > > tx=0)._fault waiting 15.000000
> > > > 2021-07-25 16:28:42.934 7fcc5dbd1700  1 --2-
> > > > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
> > > > conn(0x55c6b35a0d00 0x55c6b3325100 unknown :-1 s=NONE pgs=0 cs=0 l=0
> > > > rev1=0 rx=0 tx=0).accept
> > > > 2021-07-25 16:28:42.934 7fcc633dc700  1 --2-
> > > > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
> > > > conn(0x55c6b35a0d00 0x55c6b3325100 unknown :-1 s=BANNER_ACCEPTING
> > > > pgs=0 cs=0 l=0 rev1=0 rx=0 tx=0)._handle_peer_banner_payload
> > > > supported=1 required=0
> > > > 2021-07-25 16:28:42.934 7fcc62bdb700  1 --1-
> > > > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
> > > > conn(0x55c6b355ba80 0x55c6b3514800 :6789 s=ACCEPTING pgs=0 cs=0
> > > > l=0).send_server_banner sd=28 legacy v1:10.152.28.171:6789/0
> > > > socket_addr v1:10.152.28.171:6789/0 target_addr
> > > > v1:10.152.28.172:50976/0
> > > > 2021-07-25 16:28:42.934 7fcc5ebd3700 10 mon.osd01@0(probing) e1
> > > > ms_handle_accept con 0x55c6b355ba80 no session
> > > > 2021-07-25 16:28:42.934 7fcc633dc700 10 mon.osd01@0(probing) e1
> > > > handle_auth_request con 0x55c6b35a0d00 (start) method 2 payload 22
> > > > 2021-07-25 16:28:42.934 7fcc633dc700 10 mon.osd01@0(probing) e1
> > > > handle_auth_request haven't formed initial quorum, EBUSY
> > > > 2021-07-25 16:28:42.934 7fcc633dc700  1 --2-
> > > > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
> > > > conn(0x55c6b35a0d00 0x55c6b3325100 secure :-1 s=AUTH_ACCEPTING pgs=0
> > > > cs=0 l=1 rev1=1 rx=0 tx=0).stop
> > > > 2021-07-25 16:28:42.934 7fcc5ebd3700 10 mon.osd01@0(probing) e1
> > > > ms_handle_reset 0x55c6b35a0d00 -
> > > > 2021-07-25 16:28:42.934 7fcc5ebd3700  1 --
> > > > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] <== client.?
> > > > v1:10.152.28.172:0/3094543445 1 ==== auth(proto 0 34 bytes epoch 0) v1
> > > > ==== 64+0+0 (unknown 4015746775 0 0) 0x55c6b351d840 con 0x55c6b355ba80
> > > > 2021-07-25 16:28:42.934 7fcc62bdb700  1 --
> > > > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
> > > > v1:10.152.28.172:0/3094543445 conn(0x55c6b355ba80
> > > > legacy=0x55c6b3514800 unknown :6789 s=STATE_CONNECTION_ESTABLISHED
> > > > l=1).read_bulk peer close file descriptor 28
> > > > 2021-07-25 16:28:42.934 7fcc62bdb700  1 --
> > > > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
> > > > v1:10.152.28.172:0/3094543445 conn(0x55c6b355ba80
> > > > legacy=0x55c6b3514800 unknown :6789 s=STATE_CONNECTION_ESTABLISHED
> > > > l=1).read_until read failed
> > > > 2021-07-25 16:28:42.934 7fcc5ebd3700 10 mon.osd01@0(probing) e1
> > > > _ms_dispatch new session 0x55c6b351c880 MonSession(client.?
> > > > v1:10.152.28.172:0/3094543445 is open , features 0x3ffddff8ffecffff
> > > > (luminous)) features 0x3ffddff8ffecffff
> > > > 2021-07-25 16:28:42.934 7fcc5ebd3700 20 mon.osd01@0(probing) e1
> > > > entity_name  global_id 0 (none) caps
> > > > 2021-07-25 16:28:42.934 7fcc62bdb700  1 --1-
> > > > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
> > > > v1:10.152.28.172:0/3094543445 conn(0x55c6b355ba80 0x55c6b3514800 :6789
> > > > s=OPENED pgs=1 cs=1 l=1).handle_message read tag failed
> > > > 2021-07-25 16:28:42.934 7fcc5ebd3700  5 mon.osd01@0(probing) e1
> > > > waitlisting message auth(proto 0 34 bytes epoch 0) v1
> > > > 2021-07-25 16:28:42.934 7fcc62bdb700  1 --1-
> > > > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
> > > > v1:10.152.28.172:0/3094543445 conn(0x55c6b355ba80 0x55c6b3514800 :6789
> > > > s=OPENED pgs=1 cs=1 l=1).fault on lossy channel, failing
> > > > 2021-07-25 16:28:42.934 7fcc5ebd3700 10 mon.osd01@0(probing) e1
> > > > ms_handle_reset 0x55c6b355ba80 v1:10.152.28.172:0/3094543445
> > > > 2021-07-25 16:28:42.934 7fcc5ebd3700 10 mon.osd01@0(probing) e1
> > > > reset/close on session client.? v1:10.152.28.172:0/3094543445
> > > > 2021-07-25 16:28:42.934 7fcc5ebd3700 10 mon.osd01@0(probing) e1
> > > > remove_session 0x55c6b351c880 client.? v1:10.152.28.172:0/3094543445
> > > > features 0x3ffddff8ffecffff
> > > > 2021-07-25 16:28:42.938 7fcc5dbd1700  1 --2-
> > > > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
> > > > conn(0x55c6b3559b00 0x55c6b34ffc00 unknown :-1 s=NONE pgs=0 cs=0 l=0
> > > > rev1=0 rx=0 tx=0).accept
> > > > 2021-07-25 16:28:42.938 7fcc633dc700  1 --2-
> > > > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
> > > > conn(0x55c6b3559b00 0x55c6b34ffc00 unknown :-1 s=BANNER_ACCEPTING
> > > > pgs=0 cs=0 l=0 rev1=0 rx=0 tx=0)._handle_peer_banner_payload
> > > > supported=1 required=0
> > > > 2021-07-25 16:28:42.938 7fcc633dc700 10 mon.osd01@0(probing) e1
> > > > handle_auth_request con 0x55c6b3559b00 (start) method 2 payload 22
> > > > 2021-07-25 16:28:42.938 7fcc633dc700 10 mon.osd01@0(probing) e1
> > > > handle_auth_request haven't formed initial quorum, EBUSY
> > > > 2021-07-25 16:28:42.938 7fcc633dc700  1 --2-
> > > > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
> > > > conn(0x55c6b3559b00 0x55c6b34ffc00 secure :-1 s=AUTH_ACCEPTING pgs=0
> > > > cs=0 l=1 rev1=1 rx=0 tx=0).stop
> > > > 2021-07-25 16:28:42.938 7fcc5ebd3700 10 mon.osd01@0(probing) e1
> > > > ms_handle_reset 0x55c6b3559b00 -
> > > > 2021-07-25 16:28:43.418 7fcc613d8700  4 mon.osd01@0(probing) e1
> > > > probe_timeout 0x55c6b34ba0f0
> > > > 2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1 bootstrap
> > > > 2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1
> > > > sync_reset_requester
> > > > 2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1
> > > > unregister_cluster_logger - not registered
> > > > 2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1
> > > > cancel_probe_timeout (none scheduled)
> > > > 2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1 monmap
> > > > e1: 3 mons at {osd01=[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0],osd02=[v2:10.152.28.172:3300/0,v1:10.152.28.172:6789/0],osd03=[v2:10.152.28.173:3300/0,v1:10.152.28.173:6789/0]}
> > > > 2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1 _reset
> > > > 2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing).auth v0
> > > > _set_mon_num_rank num 0 rank 0
> > > > 2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1
> > > > cancel_probe_timeout (none scheduled)
> > > > 2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1 timecheck_finish
> > > > 2021-07-25 16:28:43.418 7fcc613d8700 15 mon.osd01@0(probing) e1 health_tick_stop
> > > > 2021-07-25 16:28:43.418 7fcc613d8700 15 mon.osd01@0(probing) e1
> > > > health_interval_stop
> > > > 2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1
> > > > scrub_event_cancel
> > > > 2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1 scrub_reset
> > > > 2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1
> > > > cancel_probe_timeout (none scheduled)
> > > > 2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1
> > > > reset_probe_timeout 0x55c6b3553260 after 2 seconds
> > > >
> > > >
> > > > still looks like a connection issue but I can connect! using telnet
> > > >
> > > > root@osd01:~# telnet 10.152.28.172 6789
> > > > Trying 10.152.28.172...
> > > > Connected to 10.152.28.172.
> > > > Escape character is '^]'.
> > > > ceph v027
> > > >
> > > >
> > > >
> > > > > .. Dan
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Sun, 25 Jul 2021, 17:53 Ansgar Jazdzewski, <a.jazdzewski@xxxxxxxxxxxxxx> wrote:
> > > > >>
> > > > >> Am So., 25. Juli 2021 um 17:17 Uhr schrieb Dan van der Ster
> > > > >> <dan@xxxxxxxxxxxxxx>:
> > > > >> >
> > > > >> > > raise the min version to nautilus
> > > > >> >
> > > > >> > Are you referring to the min osd version or the min client version?
> > > > >>
> > > > >> yes sorry was not written clearly
> > > > >>
> > > > >> > I don't think the latter will help.
> > > > >> >
> > > > >> > Are you sure that mon.osd01 can reach those other mons on ports 6789 and 3300?
> > > > >>
> > > > >> yes I just tested it one more time ping MTU and telnet to all mon ports
> > > > >>
> > > > >> > Do you have any notable custom ceph configurations on this cluster?
> > > > >>
> > > > >> No, I did not think anything fancy
> > > > >>
> > > > >> [global]
> > > > >> cluster network = 10.152.40.0/22
> > > > >> fsid = a6baa789-6be2-4ce0-ab2d-7c78b899d4bd
> > > > >> mon host = 10.152.28.171,10.152.28.172,10.152.28.173
> > > > >> mon initial members = osd01,osd02,osd03
> > > > >> osd pool default crush rule = -1
> > > > >> public network = 10.152.28.0/22
> > > > >>
> > > > >>
> > > > >> I just tried to start the mon with force-sync but as the mon did not
> > > > >> join it will not pull any data
> > > > >> ceph-mon -f --cluster ceph --id osd01 --setuser ceph --setgroup ceph
> > > > >> --debug_mon 10 --yes-i-really-mean-it --force-sync -d
> > > > >>
> > > > >> I can try to move the /var/lib/ceph/mon/ dir and recreate it!?
> > > > >>
> > > > >>
> > > > >> thanks for all the help so far!
> > > > >> Ansgar
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux