Re: one of 3 monitors keeps going down

"Robert W. Eckert" <rob@xxxxxxxxxxxxxxx> · Fri, 30 Apr 2021 18:25:00 +0000

Nothing is appearing in dmesg.  Smartctl shows no issues either.  

I did find this issue https://tracker.ceph.com/issues/24968 which showed something that may be memory related, so I will try testing that next.

-----Original Message-----
From: Eugen Block <eblock@xxxxxx> 
Sent: Friday, April 30, 2021 1:36 PM
To: Robert W. Eckert <rob@xxxxxxxxxxxxxxx>
Cc: ceph-users@xxxxxxx; Sebastian Wagner <sewagner@xxxxxxxxxx>
Subject: Re:  Re: one of 3 monitors keeps going down

Have you checked for disk failure? dmesg, smartctl etc. ?

Zitat von "Robert W. Eckert" <rob@xxxxxxxxxxxxxxx>:

> I worked through that workflow- but it seems like the one monitor will 
> run for a while - anywhere from an hour to a day, then just stop.
>
> This machine is running on AMD hardware (3600X CPU on X570 chipset) 
> while my other two are running on old intel.
>
> I did find this in the service logs
>
> 2021-04-30T16:02:40.135+0000 7f5d0a94f700 -1 rocksdb: submit_common
> error: Corruption: block checksum mismatch: expected 395334538, got
> 4289108204  in /var/lib/ceph/mon/ceph-cube/store.db/073501.sst
> offset 36769734 size 84730 code = 2 Rocksdb transaction:
>
> I am attaching the output of
> journalctl -u 
> ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867@xxxxxxxx.service
>
> The error appears to be here:
> 	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -61>  
> 2021-04-30T16:02:38.700+0000 7f5d21332700  4 mon.cube@-1(???).mgr
> e702 active server:  
> [v2:192.168.2.199:6834/1641928541,v1:192.168.2.199:6835/1641928541](2184157)
> 	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -60>  
> 2021-04-30T16:02:38.700+0000 7f5d21332700  4 mon.cube@-1(???).mgr
> e702 mkfs or daemon transitioned to available, loading commands
> 	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -59>  
> 2021-04-30T16:02:38.701+0000 7f5d21332700  4 set_mon_vals no callback 
> set
> 	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -58>  
> 2021-04-30T16:02:38.701+0000 7f5d21332700 10 set_mon_vals 
> client_cache_size = 32768
> 	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -57>  
> 2021-04-30T16:02:38.701+0000 7f5d21332700 10 set_mon_vals 
> container_image =
> docker.io/ceph/ceph@sha256:15b15fb7a708970f1b734285ac08aef45dcd76e86866af37412d041e00853743
> 	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -56>  
> 2021-04-30T16:02:38.701+0000 7f5d21332700 10 set_mon_vals 
> log_to_syslog = true
> 	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -55>  
> 2021-04-30T16:02:38.701+0000 7f5d21332700 10 set_mon_vals 
> mon_data_avail_warn = 10
> 	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -54>  
> 2021-04-30T16:02:38.701+0000 7f5d21332700 10 set_mon_vals 
> mon_warn_on_insecure_global_id_reclaim_allowed = true
> 	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -53>  
> 2021-04-30T16:02:38.701+0000 7f5d21332700  4 set_mon_vals no callback 
> set
> 	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -52>  
> 2021-04-30T16:02:38.702+0000 7f5d21332700  2 auth: KeyRing::load:  
> loaded key file /var/lib/ceph/mon/ceph-cube/keyring
> 	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -51>  
> 2021-04-30T16:02:38.702+0000 7f5d1095b700  3 rocksdb:  
> [db_impl/db_impl_compaction_flush.cc:2808] Compaction error:  
> Corruption: block checksum mismatch: expected 395334538, got  
> 4289108204  in /var/lib/ceph/mon/ceph-	cube/store.db/073501.sst  
> offset 36769734 size 84730
> 	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -50>  
> 2021-04-30T16:02:38.702+0000 7f5d21332700  5 asok(0x56327d226000) 
> register_command compact hook 0x56327e028700
> 	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -49>  
> 2021-04-30T16:02:38.702+0000 7f5d1095b700  4 rocksdb: (Original Log 
> Time 2021/04/30-16:02:38.703267) [compaction/compaction_job.cc:760]
> [default] compacted to: base level 6 level multiplier 10.00 max  
> bytes base 268435456 files[5 0 	0 0 0 0 2] max score 0.00, MB/sec:  
> 11035.6 rd, 0.0 wr, level 6, files in(5, 2) out(1) MB in(32.1,
> 126.7) out(0.0), read-write-amplify(5.0) write-amplify(0.0)
> Corruption: block checksum mismatch: expected 395334538, got
> 4289108204  in /var/lib/ceph/mon/ceph-cube/store.db/073501.sst  
> offset 36769734 size 	84730, records in: 7670, records dropped: 6759  
> output_compres
> 	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -48>  
> 2021-04-30T16:02:38.702+0000 7f5d1095b700  4 rocksdb: (Original Log 
> Time 2021/04/30-16:02:38.703283) EVENT_LOG_v1 {"time_micros":
> 1619798558703277, "job": 3, "event": "compaction_finished",  
> "compaction_time_micros": 15085, 	"compaction_time_cpu_micros":  
> 11937, "output_level": 6, "num_output_files": 1,
> "total_output_size": 12627499, "num_input_records": 7670,
> "num_output_records": 911, "num_subcompactions": 1,
> "output_compression": "NoCompression",  
> "num_single_delete_mismatches": 0, 	"num_single_delete_fallthrough":  
> 0, "lsm_state": [5, 0, 0, 0, 0, 0, 2]}
> 	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -47>  
> 2021-04-30T16:02:38.702+0000 7f5d1095b700  2 rocksdb:  
> [db_impl/db_impl_compaction_flush.cc:2344] Waiting after background 
> compaction error: Corruption: block checksum mismatch: expected 
> 395334538, got 4289108204  in
> 	/var/lib/ceph/mon/ceph-cube/store.db/073501.sst offset 36769734 size 
> 84730, Accumulated background error counts: 1
> 	Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug    -46>  
> 2021-04-30T16:02:38.702+0000 7f5d21332700  5 asok(0x56327d226000) 
> register_command smart hook 0x56327e028700
>
>
> This is running the latest pacific container, but I was seeing the 
> same issue in octopus.
>
> The container runs under podman on rhel 8, and the  
> /var/lib/ceph/mon/ceph-cube is mapped to   
> /var/lib/ceph/fe3a7cb0-69ca-11eb-8d45-c86000d08867/mon.cube.service
> on the nvme  boot drive, which has plenty of space.
>
> To recover I run a script that will stop the monitor on another host, 
> copy the store.db directory then start up, and it syncs right up.
>
>
>
> Thanks,
> Rob
>
>
>
>
>
> -----Original Message-----
> From: Sebastian Wagner <sewagner@xxxxxxxxxx>
> Sent: Thursday, April 29, 2021 7:44 AM
> To: Eugen Block <eblock@xxxxxx>; ceph-users@xxxxxxx
> Subject:  Re: one of 3 monitors keeps going down
>
> Right, here are the docs for that workflow:
>
> https://docs.ceph.com/en/latest/cephadm/mon/#mon-service
>
> Am 29.04.21 um 13:13 schrieb Eugen Block:
>> Hi,
>>
>> instead of copying MON data to this one did you also try to redeploy 
>> the MON container entirely so it gets a fresh start?
>>
>>
>> Zitat von "Robert W. Eckert" <rob@xxxxxxxxxxxxxxx>:
>>
>>> Hi,
>>> On a daily basis, one of my monitors goes down
>>>
>>> [root@cube ~]# ceph health detail
>>> HEALTH_WARN 1 failed cephadm daemon(s); 1/3 mons down, quorum 
>>> rhel1.robeckert.us,story [WRN] CEPHADM_FAILED_DAEMON: 1 failed 
>>> cephadm daemon(s)
>>>     daemon mon.cube on cube.robeckert.us is in error state [WRN]
>>> MON_DOWN: 1/3 mons down, quorum rhel1.robeckert.us,story
>>>     mon.cube (rank 2) addr
>>> [v2:192.168.2.142:3300/0,v1:192.168.2.142:6789/0] is down (out of
>>> quorum) [root@cube ~]# ceph --version ceph version 15.2.11
>>> (e3523634d9c2227df9af89a4eac33d16738c49cb)
>>> octopus (stable)
>>>
>>> I have a script that will copy the mon data from another server and 
>>> it restarts and runs well for a while.
>>>
>>> It is always the same monitor, and when I look at the logs the only 
>>> thing I really see is the cephadm log showing it down
>>>
>>> 2021-04-28 10:07:26,173 DEBUG Running command: /usr/bin/podman 
>>> --version
>>> 2021-04-28 10:07:26,217 DEBUG /usr/bin/podman: stdout podman version
>>> 2.2.1
>>> 2021-04-28 10:07:26,222 DEBUG Running command: /usr/bin/podman 
>>> inspect --format 
>>> {{.Id}},{{.Config.Image}},{{.Image}},{{.Created}},{{index
>>> .Config.Labels "io.ceph.version"}}
>>> ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867-osd.2
>>> 2021-04-28 10:07:26,326 DEBUG /usr/bin/podman: stdout 
>>> fab17e5242eb4875e266df19ca89b596a2f2b1d470273a99ff71da2ae81eeb3c,doc
>>> k 
>>> er.io/ceph/ceph:v15,5b724076c58f97872fc2f7701e8405ec809047d71528f79d
>>> a
>>> 452188daf2af72e,2021-04-26
>>> 17:13:15.54183375 -0400 EDT,
>>> 2021-04-28 10:07:26,328 DEBUG Running command: systemctl is-enabled 
>>> ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867@xxxxxxxx<mailto:ceph-fe3a7
>>> c b0-69ca-11eb-8d45-c86000d08867@xxxxxxxx>
>>>
>>> 2021-04-28 10:07:26,334 DEBUG systemctl: stdout enabled
>>> 2021-04-28 10:07:26,335 DEBUG Running command: systemctl is-active 
>>> ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867@xxxxxxxx<mailto:ceph-fe3a7
>>> c b0-69ca-11eb-8d45-c86000d08867@xxxxxxxx>
>>>
>>> 2021-04-28 10:07:26,340 DEBUG systemctl: stdout failed
>>> 2021-04-28 10:07:26,340 DEBUG Running command: /usr/bin/podman 
>>> --version
>>> 2021-04-28 10:07:26,395 DEBUG /usr/bin/podman: stdout podman version
>>> 2.2.1
>>> 2021-04-28 10:07:26,402 DEBUG Running command: /usr/bin/podman 
>>> inspect --format 
>>> {{.Id}},{{.Config.Image}},{{.Image}},{{.Created}},{{index
>>> .Config.Labels "io.ceph.version"}}
>>> ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867-mon.cube
>>> 2021-04-28 10:07:26,526 DEBUG /usr/bin/podman: stdout 
>>> 04e7c673cbacf5160427b0c3eb2f0948b2f15d02c58bd1d9dd14f975a84cfc6f,doc
>>> k 
>>> er.io/ceph/ceph:v15,5b724076c58f97872fc2f7701e8405ec809047d71528f79d
>>> a
>>> 452188daf2af72e,2021-04-28
>>> 08:54:57.614847512 -0400 EDT,
>>>
>>> I don't know if it matters, but this  server is an AMD 3600XT while 
>>> my other two servers which have had no issues are intel based.
>>>
>>> The root file system was originally on a SSD, and I switched to 
>>> NVME, so I eliminated controller or drive issues.  (I didn't see 
>>> anything in dmesg anyway)
>>>
>>> If someone could point me in the right direction on where to 
>>> troubleshoot next, I would appreciate it.
>>>
>>> Thanks,
>>> Rob Eckert
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an 
>>> email to ceph-users-leave@xxxxxxx
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an 
>> email to ceph-users-leave@xxxxxxx
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an 
> email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx