Re: one of 3 monitors keeps going down

Eugen Block <eblock@xxxxxx> · Fri, 30 Apr 2021 17:35:44 +0000

Have you checked for disk failure? dmesg, smartctl etc. ?

Zitat von "Robert W. Eckert" <rob@xxxxxxxxxxxxxxx>:

I worked through that workflow- but it seems like the one monitor  
will run for a while - anywhere from an hour to a day, then just stop.

This machine is running on AMD hardware (3600X CPU on X570 chipset)  
while my other two are running on old intel.

I did find this in the service logs

2021-04-30T16:02:40.135+0000 7f5d0a94f700 -1 rocksdb: submit_common  
error: Corruption: block checksum mismatch: expected 395334538, got  
4289108204  in /var/lib/ceph/mon/ceph-cube/store.db/073501.sst  
offset 36769734 size 84730 code = 2 Rocksdb transaction:

I am attaching the output of
journalctl -u ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867@xxxxxxxx.service

The error appears to be here:
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -61>  
2021-04-30T16:02:38.700+0000 7f5d21332700  4 mon.cube@-1(???).mgr  
e702 active server:  
[v2:192.168.2.199:6834/1641928541,v1:192.168.2.199:6835/1641928541](2184157)
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -60>  
2021-04-30T16:02:38.700+0000 7f5d21332700  4 mon.cube@-1(???).mgr  
e702 mkfs or daemon transitioned to available, loading commands
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -59>  
2021-04-30T16:02:38.701+0000 7f5d21332700  4 set_mon_vals no  
callback set
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -58>  
2021-04-30T16:02:38.701+0000 7f5d21332700 10 set_mon_vals  
client_cache_size = 32768
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -57>  
2021-04-30T16:02:38.701+0000 7f5d21332700 10 set_mon_vals  
container_image =  
docker.io/ceph/ceph@sha256:15b15fb7a708970f1b734285ac08aef45dcd76e86866af37412d041e00853743
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -56>  
2021-04-30T16:02:38.701+0000 7f5d21332700 10 set_mon_vals  
log_to_syslog = true
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -55>  
2021-04-30T16:02:38.701+0000 7f5d21332700 10 set_mon_vals  
mon_data_avail_warn = 10
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -54>  
2021-04-30T16:02:38.701+0000 7f5d21332700 10 set_mon_vals  
mon_warn_on_insecure_global_id_reclaim_allowed = true
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -53>  
2021-04-30T16:02:38.701+0000 7f5d21332700  4 set_mon_vals no  
callback set
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -52>  
2021-04-30T16:02:38.702+0000 7f5d21332700  2 auth: KeyRing::load:  
loaded key file /var/lib/ceph/mon/ceph-cube/keyring
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -51>  
2021-04-30T16:02:38.702+0000 7f5d1095b700  3 rocksdb:  
[db_impl/db_impl_compaction_flush.cc:2808] Compaction error:  
Corruption: block checksum mismatch: expected 395334538, got  
4289108204  in /var/lib/ceph/mon/ceph-	cube/store.db/073501.sst  
offset 36769734 size 84730
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -50>  
2021-04-30T16:02:38.702+0000 7f5d21332700  5 asok(0x56327d226000)  
register_command compact hook 0x56327e028700
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -49>  
2021-04-30T16:02:38.702+0000 7f5d1095b700  4 rocksdb: (Original Log  
Time 2021/04/30-16:02:38.703267) [compaction/compaction_job.cc:760]  
[default] compacted to: base level 6 level multiplier 10.00 max  
bytes base 268435456 files[5 0 	0 0 0 0 2] max score 0.00, MB/sec:  
11035.6 rd, 0.0 wr, level 6, files in(5, 2) out(1) MB in(32.1,  
126.7) out(0.0), read-write-amplify(5.0) write-amplify(0.0)  
Corruption: block checksum mismatch: expected 395334538, got  
4289108204  in /var/lib/ceph/mon/ceph-cube/store.db/073501.sst  
offset 36769734 size 	84730, records in: 7670, records dropped: 6759  
output_compres
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -48>  
2021-04-30T16:02:38.702+0000 7f5d1095b700  4 rocksdb: (Original Log  
Time 2021/04/30-16:02:38.703283) EVENT_LOG_v1 {"time_micros":  
1619798558703277, "job": 3, "event": "compaction_finished",  
"compaction_time_micros": 15085, 	"compaction_time_cpu_micros":  
11937, "output_level": 6, "num_output_files": 1,  
"total_output_size": 12627499, "num_input_records": 7670,  
"num_output_records": 911, "num_subcompactions": 1,  
"output_compression": "NoCompression",  
"num_single_delete_mismatches": 0, 	"num_single_delete_fallthrough":  
0, "lsm_state": [5, 0, 0, 0, 0, 0, 2]}
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -47>  
2021-04-30T16:02:38.702+0000 7f5d1095b700  2 rocksdb:  
[db_impl/db_impl_compaction_flush.cc:2344] Waiting after background  
compaction error: Corruption: block checksum mismatch: expected  
395334538, got 4289108204  in  
	/var/lib/ceph/mon/ceph-cube/store.db/073501.sst offset 36769734  
size 84730, Accumulated background error counts: 1
	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -46>  
2021-04-30T16:02:38.702+0000 7f5d21332700  5 asok(0x56327d226000)  
register_command smart hook 0x56327e028700

This is running the latest pacific container, but I was seeing the  
same issue in octopus.

The container runs under podman on rhel 8, and the  
/var/lib/ceph/mon/ceph-cube is mapped to   
/var/lib/ceph/fe3a7cb0-69ca-11eb-8d45-c86000d08867/mon.cube.service  
on the nvme  boot drive, which has plenty of space.

To recover I run a script that will stop the monitor on another  
host, copy the store.db directory then start up, and it syncs right  
up.

Thanks,
Rob

-----Original Message-----
From: Sebastian Wagner <sewagner@xxxxxxxxxx>
Sent: Thursday, April 29, 2021 7:44 AM
To: Eugen Block <eblock@xxxxxx>; ceph-users@xxxxxxx
Subject:  Re: one of 3 monitors keeps going down

Right, here are the docs for that workflow:

https://docs.ceph.com/en/latest/cephadm/mon/#mon-service

Am 29.04.21 um 13:13 schrieb Eugen Block:
Hi,

instead of copying MON data to this one did you also try to redeploy
the MON container entirely so it gets a fresh start?

Zitat von "Robert W. Eckert" <rob@xxxxxxxxxxxxxxx>:

Hi,
On a daily basis, one of my monitors goes down

[root@cube ~]# ceph health detail
HEALTH_WARN 1 failed cephadm daemon(s); 1/3 mons down, quorum
rhel1.robeckert.us,story [WRN] CEPHADM_FAILED_DAEMON: 1 failed
cephadm daemon(s)
    daemon mon.cube on cube.robeckert.us is in error state [WRN]
MON_DOWN: 1/3 mons down, quorum rhel1.robeckert.us,story
    mon.cube (rank 2) addr
[v2:192.168.2.142:3300/0,v1:192.168.2.142:6789/0] is down (out of
quorum) [root@cube ~]# ceph --version ceph version 15.2.11
(e3523634d9c2227df9af89a4eac33d16738c49cb)
octopus (stable)

I have a script that will copy the mon data from another server and
it restarts and runs well for a while.

It is always the same monitor, and when I look at the logs the only
thing I really see is the cephadm log showing it down

2021-04-28 10:07:26,173 DEBUG Running command: /usr/bin/podman
--version
2021-04-28 10:07:26,217 DEBUG /usr/bin/podman: stdout podman version
2.2.1
2021-04-28 10:07:26,222 DEBUG Running command: /usr/bin/podman
inspect --format
{{.Id}},{{.Config.Image}},{{.Image}},{{.Created}},{{index
.Config.Labels "io.ceph.version"}}
ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867-osd.2
2021-04-28 10:07:26,326 DEBUG /usr/bin/podman: stdout
fab17e5242eb4875e266df19ca89b596a2f2b1d470273a99ff71da2ae81eeb3c,dock
er.io/ceph/ceph:v15,5b724076c58f97872fc2f7701e8405ec809047d71528f79da
452188daf2af72e,2021-04-26
17:13:15.54183375 -0400 EDT,
2021-04-28 10:07:26,328 DEBUG Running command: systemctl is-enabled
ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867@xxxxxxxx<mailto:ceph-fe3a7c
b0-69ca-11eb-8d45-c86000d08867@xxxxxxxx>

2021-04-28 10:07:26,334 DEBUG systemctl: stdout enabled
2021-04-28 10:07:26,335 DEBUG Running command: systemctl is-active
ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867@xxxxxxxx<mailto:ceph-fe3a7c
b0-69ca-11eb-8d45-c86000d08867@xxxxxxxx>

2021-04-28 10:07:26,340 DEBUG systemctl: stdout failed
2021-04-28 10:07:26,340 DEBUG Running command: /usr/bin/podman
--version
2021-04-28 10:07:26,395 DEBUG /usr/bin/podman: stdout podman version
2.2.1
2021-04-28 10:07:26,402 DEBUG Running command: /usr/bin/podman
inspect --format
{{.Id}},{{.Config.Image}},{{.Image}},{{.Created}},{{index
.Config.Labels "io.ceph.version"}}
ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867-mon.cube
2021-04-28 10:07:26,526 DEBUG /usr/bin/podman: stdout
04e7c673cbacf5160427b0c3eb2f0948b2f15d02c58bd1d9dd14f975a84cfc6f,dock
er.io/ceph/ceph:v15,5b724076c58f97872fc2f7701e8405ec809047d71528f79da
452188daf2af72e,2021-04-28
08:54:57.614847512 -0400 EDT,

I don't know if it matters, but this  server is an AMD 3600XT while
my other two servers which have had no issues are intel based.

The root file system was originally on a SSD, and I switched to NVME,
so I eliminated controller or drive issues.  (I didn't see anything
in dmesg anyway)

If someone could point me in the right direction on where to
troubleshoot next, I would appreciate it.

Thanks,
Rob Eckert
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an  
email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx