Nothing is appearing in dmesg. Smartctl shows no issues either. I did find this issue https://tracker.ceph.com/issues/24968 which showed something that may be memory related, so I will try testing that next. -----Original Message----- From: Eugen Block <eblock@xxxxxx> Sent: Friday, April 30, 2021 1:36 PM To: Robert W. Eckert <rob@xxxxxxxxxxxxxxx> Cc: ceph-users@xxxxxxx; Sebastian Wagner <sewagner@xxxxxxxxxx> Subject: Re: Re: one of 3 monitors keeps going down Have you checked for disk failure? dmesg, smartctl etc. ? Zitat von "Robert W. Eckert" <rob@xxxxxxxxxxxxxxx>: > I worked through that workflow- but it seems like the one monitor will > run for a while - anywhere from an hour to a day, then just stop. > > This machine is running on AMD hardware (3600X CPU on X570 chipset) > while my other two are running on old intel. > > I did find this in the service logs > > 2021-04-30T16:02:40.135+0000 7f5d0a94f700 -1 rocksdb: submit_common > error: Corruption: block checksum mismatch: expected 395334538, got > 4289108204 in /var/lib/ceph/mon/ceph-cube/store.db/073501.sst > offset 36769734 size 84730 code = 2 Rocksdb transaction: > > I am attaching the output of > journalctl -u > ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867@xxxxxxxx.service > > The error appears to be here: > Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -61> > 2021-04-30T16:02:38.700+0000 7f5d21332700 4 mon.cube@-1(???).mgr > e702 active server: > [v2:192.168.2.199:6834/1641928541,v1:192.168.2.199:6835/1641928541](2184157) > Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -60> > 2021-04-30T16:02:38.700+0000 7f5d21332700 4 mon.cube@-1(???).mgr > e702 mkfs or daemon transitioned to available, loading commands > Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -59> > 2021-04-30T16:02:38.701+0000 7f5d21332700 4 set_mon_vals no callback > set > Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -58> > 2021-04-30T16:02:38.701+0000 7f5d21332700 10 set_mon_vals > client_cache_size = 32768 > Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -57> > 2021-04-30T16:02:38.701+0000 7f5d21332700 10 set_mon_vals > container_image = > docker.io/ceph/ceph@sha256:15b15fb7a708970f1b734285ac08aef45dcd76e86866af37412d041e00853743 > Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -56> > 2021-04-30T16:02:38.701+0000 7f5d21332700 10 set_mon_vals > log_to_syslog = true > Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -55> > 2021-04-30T16:02:38.701+0000 7f5d21332700 10 set_mon_vals > mon_data_avail_warn = 10 > Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -54> > 2021-04-30T16:02:38.701+0000 7f5d21332700 10 set_mon_vals > mon_warn_on_insecure_global_id_reclaim_allowed = true > Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -53> > 2021-04-30T16:02:38.701+0000 7f5d21332700 4 set_mon_vals no callback > set > Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -52> > 2021-04-30T16:02:38.702+0000 7f5d21332700 2 auth: KeyRing::load: > loaded key file /var/lib/ceph/mon/ceph-cube/keyring > Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -51> > 2021-04-30T16:02:38.702+0000 7f5d1095b700 3 rocksdb: > [db_impl/db_impl_compaction_flush.cc:2808] Compaction error: > Corruption: block checksum mismatch: expected 395334538, got > 4289108204 in /var/lib/ceph/mon/ceph- cube/store.db/073501.sst > offset 36769734 size 84730 > Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -50> > 2021-04-30T16:02:38.702+0000 7f5d21332700 5 asok(0x56327d226000) > register_command compact hook 0x56327e028700 > Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -49> > 2021-04-30T16:02:38.702+0000 7f5d1095b700 4 rocksdb: (Original Log > Time 2021/04/30-16:02:38.703267) [compaction/compaction_job.cc:760] > [default] compacted to: base level 6 level multiplier 10.00 max > bytes base 268435456 files[5 0 0 0 0 0 2] max score 0.00, MB/sec: > 11035.6 rd, 0.0 wr, level 6, files in(5, 2) out(1) MB in(32.1, > 126.7) out(0.0), read-write-amplify(5.0) write-amplify(0.0) > Corruption: block checksum mismatch: expected 395334538, got > 4289108204 in /var/lib/ceph/mon/ceph-cube/store.db/073501.sst > offset 36769734 size 84730, records in: 7670, records dropped: 6759 > output_compres > Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -48> > 2021-04-30T16:02:38.702+0000 7f5d1095b700 4 rocksdb: (Original Log > Time 2021/04/30-16:02:38.703283) EVENT_LOG_v1 {"time_micros": > 1619798558703277, "job": 3, "event": "compaction_finished", > "compaction_time_micros": 15085, "compaction_time_cpu_micros": > 11937, "output_level": 6, "num_output_files": 1, > "total_output_size": 12627499, "num_input_records": 7670, > "num_output_records": 911, "num_subcompactions": 1, > "output_compression": "NoCompression", > "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": > 0, "lsm_state": [5, 0, 0, 0, 0, 0, 2]} > Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -47> > 2021-04-30T16:02:38.702+0000 7f5d1095b700 2 rocksdb: > [db_impl/db_impl_compaction_flush.cc:2344] Waiting after background > compaction error: Corruption: block checksum mismatch: expected > 395334538, got 4289108204 in > /var/lib/ceph/mon/ceph-cube/store.db/073501.sst offset 36769734 size > 84730, Accumulated background error counts: 1 > Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -46> > 2021-04-30T16:02:38.702+0000 7f5d21332700 5 asok(0x56327d226000) > register_command smart hook 0x56327e028700 > > > This is running the latest pacific container, but I was seeing the > same issue in octopus. > > The container runs under podman on rhel 8, and the > /var/lib/ceph/mon/ceph-cube is mapped to > /var/lib/ceph/fe3a7cb0-69ca-11eb-8d45-c86000d08867/mon.cube.service > on the nvme boot drive, which has plenty of space. > > To recover I run a script that will stop the monitor on another host, > copy the store.db directory then start up, and it syncs right up. > > > > Thanks, > Rob > > > > > > -----Original Message----- > From: Sebastian Wagner <sewagner@xxxxxxxxxx> > Sent: Thursday, April 29, 2021 7:44 AM > To: Eugen Block <eblock@xxxxxx>; ceph-users@xxxxxxx > Subject: Re: one of 3 monitors keeps going down > > Right, here are the docs for that workflow: > > https://docs.ceph.com/en/latest/cephadm/mon/#mon-service > > Am 29.04.21 um 13:13 schrieb Eugen Block: >> Hi, >> >> instead of copying MON data to this one did you also try to redeploy >> the MON container entirely so it gets a fresh start? >> >> >> Zitat von "Robert W. Eckert" <rob@xxxxxxxxxxxxxxx>: >> >>> Hi, >>> On a daily basis, one of my monitors goes down >>> >>> [root@cube ~]# ceph health detail >>> HEALTH_WARN 1 failed cephadm daemon(s); 1/3 mons down, quorum >>> rhel1.robeckert.us,story [WRN] CEPHADM_FAILED_DAEMON: 1 failed >>> cephadm daemon(s) >>> daemon mon.cube on cube.robeckert.us is in error state [WRN] >>> MON_DOWN: 1/3 mons down, quorum rhel1.robeckert.us,story >>> mon.cube (rank 2) addr >>> [v2:192.168.2.142:3300/0,v1:192.168.2.142:6789/0] is down (out of >>> quorum) [root@cube ~]# ceph --version ceph version 15.2.11 >>> (e3523634d9c2227df9af89a4eac33d16738c49cb) >>> octopus (stable) >>> >>> I have a script that will copy the mon data from another server and >>> it restarts and runs well for a while. >>> >>> It is always the same monitor, and when I look at the logs the only >>> thing I really see is the cephadm log showing it down >>> >>> 2021-04-28 10:07:26,173 DEBUG Running command: /usr/bin/podman >>> --version >>> 2021-04-28 10:07:26,217 DEBUG /usr/bin/podman: stdout podman version >>> 2.2.1 >>> 2021-04-28 10:07:26,222 DEBUG Running command: /usr/bin/podman >>> inspect --format >>> {{.Id}},{{.Config.Image}},{{.Image}},{{.Created}},{{index >>> .Config.Labels "io.ceph.version"}} >>> ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867-osd.2 >>> 2021-04-28 10:07:26,326 DEBUG /usr/bin/podman: stdout >>> fab17e5242eb4875e266df19ca89b596a2f2b1d470273a99ff71da2ae81eeb3c,doc >>> k >>> er.io/ceph/ceph:v15,5b724076c58f97872fc2f7701e8405ec809047d71528f79d >>> a >>> 452188daf2af72e,2021-04-26 >>> 17:13:15.54183375 -0400 EDT, >>> 2021-04-28 10:07:26,328 DEBUG Running command: systemctl is-enabled >>> ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867@xxxxxxxx<mailto:ceph-fe3a7 >>> c b0-69ca-11eb-8d45-c86000d08867@xxxxxxxx> >>> >>> 2021-04-28 10:07:26,334 DEBUG systemctl: stdout enabled >>> 2021-04-28 10:07:26,335 DEBUG Running command: systemctl is-active >>> ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867@xxxxxxxx<mailto:ceph-fe3a7 >>> c b0-69ca-11eb-8d45-c86000d08867@xxxxxxxx> >>> >>> 2021-04-28 10:07:26,340 DEBUG systemctl: stdout failed >>> 2021-04-28 10:07:26,340 DEBUG Running command: /usr/bin/podman >>> --version >>> 2021-04-28 10:07:26,395 DEBUG /usr/bin/podman: stdout podman version >>> 2.2.1 >>> 2021-04-28 10:07:26,402 DEBUG Running command: /usr/bin/podman >>> inspect --format >>> {{.Id}},{{.Config.Image}},{{.Image}},{{.Created}},{{index >>> .Config.Labels "io.ceph.version"}} >>> ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867-mon.cube >>> 2021-04-28 10:07:26,526 DEBUG /usr/bin/podman: stdout >>> 04e7c673cbacf5160427b0c3eb2f0948b2f15d02c58bd1d9dd14f975a84cfc6f,doc >>> k >>> er.io/ceph/ceph:v15,5b724076c58f97872fc2f7701e8405ec809047d71528f79d >>> a >>> 452188daf2af72e,2021-04-28 >>> 08:54:57.614847512 -0400 EDT, >>> >>> I don't know if it matters, but this server is an AMD 3600XT while >>> my other two servers which have had no issues are intel based. >>> >>> The root file system was originally on a SSD, and I switched to >>> NVME, so I eliminated controller or drive issues. (I didn't see >>> anything in dmesg anyway) >>> >>> If someone could point me in the right direction on where to >>> troubleshoot next, I would appreciate it. >>> >>> Thanks, >>> Rob Eckert >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an >>> email to ceph-users-leave@xxxxxxx >> >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an >> email to ceph-users-leave@xxxxxxx >> > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an > email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx