Re: A MON doesn't start after Octopus update

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I would re-create the MON, it's usually faster than debugging the root cause.


Zitat von Hervé Ballans <herve.ballans@xxxxxxxxxxxxx>:

Hi,

We encounter a strange issue on one of our Ceph cluster.

Cluster has been running with no concerns since many years. But recently, after an upgrade from Octopus 15.2.1 to 15.2.4, all is OK but one of the 3 MON refuses to start !

Journalctl shows some problems regarding the RocksDBStore and a problem of Bad table magic number.

On the other nodes, MONs start correctly, and all other services too, MDS, MGR and OSD (just we have 2 OSDs stopped due to hardware disk problem). Some OSD are nearfull but I don't think there is a link with the MON issue ?..

Could someone help me on this issue please ?

Thnks in advance,
Hervé

Here are few logs :

# ceph versions
{
    "mon": {
        "ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)": 3
    },
    "mgr": {
        "ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)": 3
    },
    "osd": {
        "ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)": 131
    },
    "mds": {
        "ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)": 3
    },
    "overall": {
        "ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)": 140
    }
}


# ceph -s
  cluster:
    id:     00ff6861-f95c-4740-a570-da6ad6a261cd
    health: HEALTH_WARN
            1/3 mons down, quorum inf-ceph-mds,cluster-r730-k80-1
            8 backfillfull osd(s)
            18 nearfull osd(s)
            Low space hindering backfill (add storage if this doesn't resolve itself): 21 pgs backfill_toofull             Degraded data redundancy: 5570/50534523 objects degraded (0.011%), 2 pgs degraded, 2 pgs undersized
            2 pool(s) backfillfull
            1 pool(s) nearfull
            25 daemons have recently crashed

  services:
    mon: 3 daemons, quorum inf-ceph-mds,cluster-r730-k80-1 (age 44h), out of quorum: cluster-r820-1     mgr: cluster-r730-k80-1(active, since 44h), standbys: inf-ceph-mds, cluster-r820-1     mds: cephfs_cluster:2 {0=cluster-r730-k80-1=up:active,1=inf-ceph-mds=up:active} 1 up:standby
    osd: 133 osds: 131 up (since 42h), 131 in (since 42h); 21 remapped pgs

  task status:
    scrub status:
        mds.cluster-r730-k80-1: idle
        mds.inf-ceph-mds: idle

  data:
    pools:   3 pools, 4129 pgs
    objects: 16.84M objects, 47 TiB
    usage:   143 TiB used, 33 TiB / 176 TiB avail
    pgs:     5570/50534523 objects degraded (0.011%)
             75039/50534523 objects misplaced (0.148%)
             4104 active+clean
             19   active+remapped+backfill_toofull
             3    active+clean+scrubbing+deep
             2 active+undersized+degraded+remapped+backfill_toofull
             1    active+clean+scrubbing

  io:
    client:   19 MiB/s rd, 26 MiB/s wr, 5 op/s rd, 11 op/s wr


# systemctl status ceph-mon@cluster-r820-1.service
● ceph-mon@cluster-r820-1.service - Ceph cluster monitor daemon
   Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: enabled)    Active: failed (Result: signal) since Fri 2020-07-10 11:44:33 CEST; 24min ago   Process: 201002 ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER} --id cluster-r820-1 --setuser ceph --setgroup ceph (code=killed, signal=ABRT)
 Main PID: 201002 (code=killed, signal=ABRT)

Jul 10 11:44:33 cluster-r820-1 systemd[1]: ceph-mon@cluster-r820-1.service: Service RestartSec=10s expired, scheduling restart. Jul 10 11:44:33 cluster-r820-1 systemd[1]: ceph-mon@cluster-r820-1.service: Scheduled restart job, restart counter is at 5. Jul 10 11:44:33 cluster-r820-1 systemd[1]: Stopped Ceph cluster monitor daemon. Jul 10 11:44:33 cluster-r820-1 systemd[1]: ceph-mon@cluster-r820-1.service: Start request repeated too quickly. Jul 10 11:44:33 cluster-r820-1 systemd[1]: ceph-mon@cluster-r820-1.service: Failed with result 'signal'. Jul 10 11:44:33 cluster-r820-1 systemd[1]: Failed to start Ceph cluster monitor daemon.


# journalctl --follow

Jul 10 11:43:38 cluster-r820-1 systemd[1]: Started Ceph cluster monitor daemon. Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: /build/ceph-15.2.4/src/kv/RocksDBStore.cc: In function 'virtual int RocksDBStore::get(const string&, const string&, ceph::bufferlist*)' thread 7fe5331a65c0 time 2020-07-10T11:43:39.192838+0200 Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: /build/ceph-15.2.4/src/kv/RocksDBStore.cc: 1152: ceph_abort_msg("Bad table magic number: expected 9863518390377041911, found 5836668660487291005 in /var/lib/ceph/mon/ceph-cluster-r820-1/store.db/716151.sst") Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable) Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe1) [0x7fe534aea7c6] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  2: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x3ce) [0x559f6081b2ae] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  3: (main()+0x108d) [0x559f605a85fd] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  4: (__libc_start_main()+0xeb) [0x7fe5336ba09b] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  5: (_start()+0x2a) [0x559f605b93da] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: *** Caught signal (Aborted) ** Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  in thread 7fe5331a65c0 thread_name:ceph-mon Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 2020-07-10T11:43:39.192+0200 7fe5331a65c0 -1 /build/ceph-15.2.4/src/kv/RocksDBStore.cc: In function 'virtual int RocksDBStore::get(const string&, const string&, ceph::bufferlist*)' thread 7fe5331a65c0 time 2020-07-10T11:43:39.192838+0200 Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: /build/ceph-15.2.4/src/kv/RocksDBStore.cc: 1152: ceph_abort_msg("Bad table magic number: expected 9863518390377041911, found 5836668660487291005 in /var/lib/ceph/mon/ceph-cluster-r820-1/store.db/716151.sst") Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable) Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe1) [0x7fe534aea7c6] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  2: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x3ce) [0x559f6081b2ae] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  3: (main()+0x108d) [0x559f605a85fd] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  4: (__libc_start_main()+0xeb) [0x7fe5336ba09b] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  5: (_start()+0x2a) [0x559f605b93da] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable) Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  1: (()+0x12730) [0x7fe533bea730] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  2: (gsignal()+0x10b) [0x7fe5336cd7bb] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  3: (abort()+0x121) [0x7fe5336b8535] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b2) [0x7fe534aea897] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  5: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x3ce) [0x559f6081b2ae] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  6: (main()+0x108d) [0x559f605a85fd] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  7: (__libc_start_main()+0xeb) [0x7fe5336ba09b] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  8: (_start()+0x2a) [0x559f605b93da] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 2020-07-10T11:43:39.196+0200 7fe5331a65c0 -1 *** Caught signal (Aborted) ** Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  in thread 7fe5331a65c0 thread_name:ceph-mon Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable) Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  1: (()+0x12730) [0x7fe533bea730] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  2: (gsignal()+0x10b) [0x7fe5336cd7bb] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  3: (abort()+0x121) [0x7fe5336b8535] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b2) [0x7fe534aea897] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  5: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x3ce) [0x559f6081b2ae] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  6: (main()+0x108d) [0x559f605a85fd] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  7: (__libc_start_main()+0xeb) [0x7fe5336ba09b] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  8: (_start()+0x2a) [0x559f605b93da] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:     -1> 2020-07-10T11:43:39.192+0200 7fe5331a65c0 -1 /build/ceph-15.2.4/src/kv/RocksDBStore.cc: In function 'virtual int RocksDBStore::get(const string&, const string&, ceph::bufferlist*)' thread 7fe5331a65c0 time 2020-07-10T11:43:39.192838+0200 Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: /build/ceph-15.2.4/src/kv/RocksDBStore.cc: 1152: ceph_abort_msg("Bad table magic number: expected 9863518390377041911, found 5836668660487291005 in /var/lib/ceph/mon/ceph-cluster-r820-1/store.db/716151.sst") Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable) Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe1) [0x7fe534aea7c6] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  2: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x3ce) [0x559f6081b2ae] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  3: (main()+0x108d) [0x559f605a85fd] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  4: (__libc_start_main()+0xeb) [0x7fe5336ba09b] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  5: (_start()+0x2a) [0x559f605b93da] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:      0> 2020-07-10T11:43:39.196+0200 7fe5331a65c0 -1 *** Caught signal (Aborted) ** Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  in thread 7fe5331a65c0 thread_name:ceph-mon Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable) Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  1: (()+0x12730) [0x7fe533bea730] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  2: (gsignal()+0x10b) [0x7fe5336cd7bb] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  3: (abort()+0x121) [0x7fe5336b8535] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b2) [0x7fe534aea897] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  5: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x3ce) [0x559f6081b2ae] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  6: (main()+0x108d) [0x559f605a85fd] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  7: (__libc_start_main()+0xeb) [0x7fe5336ba09b] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  8: (_start()+0x2a) [0x559f605b93da] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:     -1> 2020-07-10T11:43:39.192+0200 7fe5331a65c0 -1 /build/ceph-15.2.4/src/kv/RocksDBStore.cc: In function 'virtual int RocksDBStore::get(const string&, const string&, ceph::bufferlist*)' thread 7fe5331a65c0 time 2020-07-10T11:43:39.192838+0200 Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: /build/ceph-15.2.4/src/kv/RocksDBStore.cc: 1152: ceph_abort_msg("Bad table magic number: expected 9863518390377041911, found 5836668660487291005 in /var/lib/ceph/mon/ceph-cluster-r820-1/store.db/716151.sst") Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable) Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe1) [0x7fe534aea7c6] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  2: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x3ce) [0x559f6081b2ae] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  3: (main()+0x108d) [0x559f605a85fd] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  4: (__libc_start_main()+0xeb) [0x7fe5336ba09b] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  5: (_start()+0x2a) [0x559f605b93da] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:      0> 2020-07-10T11:43:39.196+0200 7fe5331a65c0 -1 *** Caught signal (Aborted) ** Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  in thread 7fe5331a65c0 thread_name:ceph-mon Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable) Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  1: (()+0x12730) [0x7fe533bea730] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  2: (gsignal()+0x10b) [0x7fe5336cd7bb] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  3: (abort()+0x121) [0x7fe5336b8535] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b2) [0x7fe534aea897] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  5: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x3ce) [0x559f6081b2ae] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  6: (main()+0x108d) [0x559f605a85fd] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  7: (__libc_start_main()+0xeb) [0x7fe5336ba09b] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  8: (_start()+0x2a) [0x559f605b93da] Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Jul 10 11:43:39 cluster-r820-1 systemd[1]: ceph-mon@cluster-r820-1.service: Main process exited, code=killed, status=6/ABRT Jul 10 11:43:39 cluster-r820-1 systemd[1]: ceph-mon@cluster-r820-1.service: Failed with result 'signal'. Jul 10 11:43:49 cluster-r820-1 systemd[1]: ceph-mon@cluster-r820-1.service: Service RestartSec=10s expired, scheduling restart.
...

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux