Re: one of 3 monitors keeps going down

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Have you checked for disk failure? dmesg, smartctl etc. ?


Zitat von "Robert W. Eckert" <rob@xxxxxxxxxxxxxxx>:

I worked through that workflow- but it seems like the one monitor will run for a while - anywhere from an hour to a day, then just stop.

This machine is running on AMD hardware (3600X CPU on X570 chipset) while my other two are running on old intel.

I did find this in the service logs

2021-04-30T16:02:40.135+0000 7f5d0a94f700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 395334538, got 4289108204 in /var/lib/ceph/mon/ceph-cube/store.db/073501.sst offset 36769734 size 84730 code = 2 Rocksdb transaction:

I am attaching the output of
journalctl -u ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867@xxxxxxxx.service

The error appears to be here:
Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -61> 2021-04-30T16:02:38.700+0000 7f5d21332700 4 mon.cube@-1(???).mgr e702 active server: [v2:192.168.2.199:6834/1641928541,v1:192.168.2.199:6835/1641928541](2184157) Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -60> 2021-04-30T16:02:38.700+0000 7f5d21332700 4 mon.cube@-1(???).mgr e702 mkfs or daemon transitioned to available, loading commands Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -59> 2021-04-30T16:02:38.701+0000 7f5d21332700 4 set_mon_vals no callback set Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -58> 2021-04-30T16:02:38.701+0000 7f5d21332700 10 set_mon_vals client_cache_size = 32768 Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -57> 2021-04-30T16:02:38.701+0000 7f5d21332700 10 set_mon_vals container_image = docker.io/ceph/ceph@sha256:15b15fb7a708970f1b734285ac08aef45dcd76e86866af37412d041e00853743 Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -56> 2021-04-30T16:02:38.701+0000 7f5d21332700 10 set_mon_vals log_to_syslog = true Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -55> 2021-04-30T16:02:38.701+0000 7f5d21332700 10 set_mon_vals mon_data_avail_warn = 10 Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -54> 2021-04-30T16:02:38.701+0000 7f5d21332700 10 set_mon_vals mon_warn_on_insecure_global_id_reclaim_allowed = true Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -53> 2021-04-30T16:02:38.701+0000 7f5d21332700 4 set_mon_vals no callback set Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -52> 2021-04-30T16:02:38.702+0000 7f5d21332700 2 auth: KeyRing::load: loaded key file /var/lib/ceph/mon/ceph-cube/keyring Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -51> 2021-04-30T16:02:38.702+0000 7f5d1095b700 3 rocksdb: [db_impl/db_impl_compaction_flush.cc:2808] Compaction error: Corruption: block checksum mismatch: expected 395334538, got 4289108204 in /var/lib/ceph/mon/ceph- cube/store.db/073501.sst offset 36769734 size 84730 Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -50> 2021-04-30T16:02:38.702+0000 7f5d21332700 5 asok(0x56327d226000) register_command compact hook 0x56327e028700 Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -49> 2021-04-30T16:02:38.702+0000 7f5d1095b700 4 rocksdb: (Original Log Time 2021/04/30-16:02:38.703267) [compaction/compaction_job.cc:760] [default] compacted to: base level 6 level multiplier 10.00 max bytes base 268435456 files[5 0 0 0 0 0 2] max score 0.00, MB/sec: 11035.6 rd, 0.0 wr, level 6, files in(5, 2) out(1) MB in(32.1, 126.7) out(0.0), read-write-amplify(5.0) write-amplify(0.0) Corruption: block checksum mismatch: expected 395334538, got 4289108204 in /var/lib/ceph/mon/ceph-cube/store.db/073501.sst offset 36769734 size 84730, records in: 7670, records dropped: 6759 output_compres Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -48> 2021-04-30T16:02:38.702+0000 7f5d1095b700 4 rocksdb: (Original Log Time 2021/04/30-16:02:38.703283) EVENT_LOG_v1 {"time_micros": 1619798558703277, "job": 3, "event": "compaction_finished", "compaction_time_micros": 15085, "compaction_time_cpu_micros": 11937, "output_level": 6, "num_output_files": 1, "total_output_size": 12627499, "num_input_records": 7670, "num_output_records": 911, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [5, 0, 0, 0, 0, 0, 2]} Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -47> 2021-04-30T16:02:38.702+0000 7f5d1095b700 2 rocksdb: [db_impl/db_impl_compaction_flush.cc:2344] Waiting after background compaction error: Corruption: block checksum mismatch: expected 395334538, got 4289108204 in /var/lib/ceph/mon/ceph-cube/store.db/073501.sst offset 36769734 size 84730, Accumulated background error counts: 1 Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -46> 2021-04-30T16:02:38.702+0000 7f5d21332700 5 asok(0x56327d226000) register_command smart hook 0x56327e028700


This is running the latest pacific container, but I was seeing the same issue in octopus.

The container runs under podman on rhel 8, and the /var/lib/ceph/mon/ceph-cube is mapped to /var/lib/ceph/fe3a7cb0-69ca-11eb-8d45-c86000d08867/mon.cube.service on the nvme boot drive, which has plenty of space.

To recover I run a script that will stop the monitor on another host, copy the store.db directory then start up, and it syncs right up.



Thanks,
Rob





-----Original Message-----
From: Sebastian Wagner <sewagner@xxxxxxxxxx>
Sent: Thursday, April 29, 2021 7:44 AM
To: Eugen Block <eblock@xxxxxx>; ceph-users@xxxxxxx
Subject:  Re: one of 3 monitors keeps going down

Right, here are the docs for that workflow:

https://docs.ceph.com/en/latest/cephadm/mon/#mon-service

Am 29.04.21 um 13:13 schrieb Eugen Block:
Hi,

instead of copying MON data to this one did you also try to redeploy
the MON container entirely so it gets a fresh start?


Zitat von "Robert W. Eckert" <rob@xxxxxxxxxxxxxxx>:

Hi,
On a daily basis, one of my monitors goes down

[root@cube ~]# ceph health detail
HEALTH_WARN 1 failed cephadm daemon(s); 1/3 mons down, quorum
rhel1.robeckert.us,story [WRN] CEPHADM_FAILED_DAEMON: 1 failed
cephadm daemon(s)
    daemon mon.cube on cube.robeckert.us is in error state [WRN]
MON_DOWN: 1/3 mons down, quorum rhel1.robeckert.us,story
    mon.cube (rank 2) addr
[v2:192.168.2.142:3300/0,v1:192.168.2.142:6789/0] is down (out of
quorum) [root@cube ~]# ceph --version ceph version 15.2.11
(e3523634d9c2227df9af89a4eac33d16738c49cb)
octopus (stable)

I have a script that will copy the mon data from another server and
it restarts and runs well for a while.

It is always the same monitor, and when I look at the logs the only
thing I really see is the cephadm log showing it down

2021-04-28 10:07:26,173 DEBUG Running command: /usr/bin/podman
--version
2021-04-28 10:07:26,217 DEBUG /usr/bin/podman: stdout podman version
2.2.1
2021-04-28 10:07:26,222 DEBUG Running command: /usr/bin/podman
inspect --format
{{.Id}},{{.Config.Image}},{{.Image}},{{.Created}},{{index
.Config.Labels "io.ceph.version"}}
ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867-osd.2
2021-04-28 10:07:26,326 DEBUG /usr/bin/podman: stdout
fab17e5242eb4875e266df19ca89b596a2f2b1d470273a99ff71da2ae81eeb3c,dock
er.io/ceph/ceph:v15,5b724076c58f97872fc2f7701e8405ec809047d71528f79da
452188daf2af72e,2021-04-26
17:13:15.54183375 -0400 EDT,
2021-04-28 10:07:26,328 DEBUG Running command: systemctl is-enabled
ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867@xxxxxxxx<mailto:ceph-fe3a7c
b0-69ca-11eb-8d45-c86000d08867@xxxxxxxx>

2021-04-28 10:07:26,334 DEBUG systemctl: stdout enabled
2021-04-28 10:07:26,335 DEBUG Running command: systemctl is-active
ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867@xxxxxxxx<mailto:ceph-fe3a7c
b0-69ca-11eb-8d45-c86000d08867@xxxxxxxx>

2021-04-28 10:07:26,340 DEBUG systemctl: stdout failed
2021-04-28 10:07:26,340 DEBUG Running command: /usr/bin/podman
--version
2021-04-28 10:07:26,395 DEBUG /usr/bin/podman: stdout podman version
2.2.1
2021-04-28 10:07:26,402 DEBUG Running command: /usr/bin/podman
inspect --format
{{.Id}},{{.Config.Image}},{{.Image}},{{.Created}},{{index
.Config.Labels "io.ceph.version"}}
ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867-mon.cube
2021-04-28 10:07:26,526 DEBUG /usr/bin/podman: stdout
04e7c673cbacf5160427b0c3eb2f0948b2f15d02c58bd1d9dd14f975a84cfc6f,dock
er.io/ceph/ceph:v15,5b724076c58f97872fc2f7701e8405ec809047d71528f79da
452188daf2af72e,2021-04-28
08:54:57.614847512 -0400 EDT,

I don't know if it matters, but this  server is an AMD 3600XT while
my other two servers which have had no issues are intel based.

The root file system was originally on a SSD, and I switched to NVME,
so I eliminated controller or drive issues.  (I didn't see anything
in dmesg anyway)

If someone could point me in the right direction on where to
troubleshoot next, I would appreciate it.

Thanks,
Rob Eckert
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux