Re: "CPU CATERR Fault" Was: Self shutdown of 1 whole system (Derbian stretch/Ceph 12.2.7/bluestore)

Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> · Mon, 23 Jul 2018 12:43:16 +0200

Am 23.07.2018 um 11:39 schrieb Nicolas Huillard:
> Le lundi 23 juillet 2018 à 10:28 +0200, Caspar Smit a écrit :
>> Do you have any hardware watchdog running in the system? A watchdog
>> could
>> trigger a powerdown if it meets some value. Any event logs from the
>> chassis
>> itself?
> 
> Nice suggestions ;-)
> 
> I see some [watchdog/N] and one [watchdogd] kernel threads, along with
> a "kernel: [    0.116002] NMI watchdog: enabled on all CPUs,
> permanently consumes one hw-PMU counter." line in the kernel log, but
> no user-land watchdog daemon: I'm not sure if the watchdog is actually
> active.
> 
> There ARE chassis/BMC/IPMI level events, one of which is "CPU CATERR
> Fault", with a timestamp matching the timestamps below, and no more
> information.

If this kind of failure (or a less severe one) also happens at runtime, mcelog should catch it. 
For CATERR errors, we also found that sometimes the web interface of the BMC shows more information for the event log entry 
than querying the event log via ipmitool - you may want to check this. 

> If I understand correctly, this is a signal emitted by the CPU, to the
> BMC, upon "catastrophic error" (more than "fatal"), which the BMC must
> respond to the way it wants, Intel suggestions including resetting the
> chassis.
> 
> https://www.intel.in/content/dam/www/public/us/en/documents/white-paper
> s/platform-level-error-strategies-paper.pdf
> 
> Does that mean that the hardware is failing, or a neutrino just crossed
> some CPU register?
> CPU is a Xeon D-1521 with ECC memory.
> 
>> Kind regards,
> 
> Many thanks!
> 
>>
>> Caspar
>>
>> 2018-07-21 10:31 GMT+02:00 Nicolas Huillard <nhuillard@xxxxxxxxxxx>:
>>
>>> Hi all,
>>>
>>> One of my server silently shutdown last night, with no explanation
>>> whatsoever in any logs. According to the existing logs, the
>>> shutdown
>>> (without reboot) happened between 03:58:20.061452 (last timestamp
>>> from
>>> /var/log/ceph/ceph-mgr.oxygene.log) and 03:59:01.515308 (new MON
>>> election called, for which oxygene didn't answer).
>>>
>>> Is there any way in which Ceph could silently shutdown a server?
>>> Can SMART self-test influence scrubbing or compaction?
>>>
>>> The only thing I have is that smartd stated a long self-test on
>>> both
>>> OSD spinning drives on that host:
>>> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sda [SAT],
>>> starting
>>> scheduled Long Self-Test.
>>> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdb [SAT],
>>> starting
>>> scheduled Long Self-Test.
>>> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdc [SAT],
>>> starting
>>> scheduled Long Self-Test.
>>> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sda [SAT], self-
>>> test in
>>> progress, 90% remaining
>>> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdb [SAT], self-
>>> test in
>>> progress, 90% remaining
>>> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdc [SAT],
>>> previous
>>> self-test completed without error
>>>
>>> ...and smartctl now says that the self-tests didn't finish (on both
>>> drives) :
>>> # 1  Extended offline    Interrupted (host
>>> reset)      00%     10636
>>>     -
>>>
>>> MON logs on oxygene talks about rockdb compaction a few minutes
>>> before
>>> the shutdown, and a deep-scrub finished earlier:
>>> /var/log/ceph/ceph-osd.6.log
>>> 2018-07-21 03:32:54.086021 7fd15d82c700  0 log_channel(cluster) log
>>> [DBG]
>>> : 6.1d deep-scrub starts
>>> 2018-07-21 03:34:31.185549 7fd15d82c700  0 log_channel(cluster) log
>>> [DBG]
>>> : 6.1d deep-scrub ok
>>> 2018-07-21 03:43:36.720707 7fd178082700  0 --
>>> 172.22.0.16:6801/478362 >>
>>> 172.21.0.16:6800/1459922146 conn(0x556f0642b800 :6801
>>> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
>>> l=1).handle_connect_msg: challenging authorizer
>>>
>>> /var/log/ceph/ceph-mgr.oxygene.log
>>> 2018-07-21 03:58:16.060137 7fbcd3777700  1 mgr send_beacon standby
>>> 2018-07-21 03:58:18.060733 7fbcd3777700  1 mgr send_beacon standby
>>> 2018-07-21 03:58:20.061452 7fbcd3777700  1 mgr send_beacon standby
>>>
>>> /var/log/ceph/ceph-mon.oxygene.log
>>> 2018-07-21 03:52:27.702314 7f25b5406700  4 rocksdb: (Original Log
>>> Time
>>> 2018/07/21-03:52:27.702302) [/build/ceph-12.2.7/src/
>>> rocksdb/db/db_impl_compaction_flush.cc:1392] [default] Manual
>>> compaction
>>> from level-0 to level-1 from 'mgrstat .. '
>>> 2018-07-21 03:52:27.702321 7f25b5406700  4 rocksdb:
>>> [/build/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:1403]
>>> [default] [JOB
>>> 1746] Compacting 1@0 + 1@1 files to L1, score -1.00
>>> 2018-07-21 03:52:27.702329 7f25b5406700  4 rocksdb:
>>> [/build/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:1407]
>>> [default]
>>> Compaction start summary: Base version 1745 Base level 0, inputs:
>>> [149507(602KB)], [149505(13MB)]
>>> 2018-07-21 03:52:27.702348 7f25b5406700  4 rocksdb: EVENT_LOG_v1
>>> {"time_micros": 1532137947702334, "job": 1746, "event":
>>> "compaction_started", "files_L0": [149507], "files_L1": [149505],
>>> "score":
>>> -1, "input_data_size": 14916379}
>>> 2018-07-21 03:52:27.785532 7f25b5406700  4 rocksdb:
>>> [/build/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:1116]
>>> [default] [JOB
>>> 1746] Generated table #149508: 4904 keys, 14808953 bytes
>>> 2018-07-21 03:52:27.785587 7f25b5406700  4 rocksdb: EVENT_LOG_v1
>>> {"time_micros": 1532137947785565, "cf_name": "default", "job":
>>> 1746,
>>> "event": "table_file_creation", "file_number": 149508, "file_size":
>>> 14808953, "table_properties": {"data
>>> 2018-07-21 03:52:27.785627 7f25b5406700  4 rocksdb:
>>> [/build/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:1173]
>>> [default] [JOB
>>> 1746] Compacted 1@0 + 1@1 files to L1 => 14808953 bytes
>>> 2018-07-21 03:52:27.785656 7f25b5406700  3 rocksdb:
>>> [/build/ceph-12.2.7/src/rocksdb/db/version_set.cc:2087] More
>>> existing
>>> levels in DB than needed. max_bytes_for_level_multiplier may not be
>>> guaranteed.
>>> 2018-07-21 03:52:27.791640 7f25b5406700  4 rocksdb: (Original Log
>>> Time
>>> 2018/07/21-03:52:27.791526) [/build/ceph-12.2.7/src/
>>> rocksdb/db/compaction_job.cc:621] [default] compacted to: base
>>> level 1
>>> max bytes base 26843546 files[0 1 0 0 0 0 0]
>>> 2018-07-21 03:52:27.791657 7f25b5406700  4 rocksdb: (Original Log
>>> Time
>>> 2018/07/21-03:52:27.791563) EVENT_LOG_v1 {"time_micros":
>>> 1532137947791548,
>>> "job": 1746, "event": "compaction_finished",
>>> "compaction_time_micros":
>>> 83261, "output_level"
>>> 2018-07-21 03:52:27.792024 7f25b5406700  4 rocksdb: EVENT_LOG_v1
>>> {"time_micros": 1532137947792019, "job": 1746, "event":
>>> "table_file_deletion", "file_number": 149507}
>>> 2018-07-21 03:52:27.796596 7f25b5406700  4 rocksdb: EVENT_LOG_v1
>>> {"time_micros": 1532137947796592, "job": 1746, "event":
>>> "table_file_deletion", "file_number": 149505}
>>> 2018-07-21 03:52:27.796690 7f25b6408700  4 rocksdb:
>>> [/build/ceph-12.2.7/src/rocksdb/db/db_impl_compaction_flush.cc:839]
>>> [default] Manual compaction starting
>>> ...
>>> 2018-07-21 03:53:33.404428 7f25b5406700  4 rocksdb:
>>> [/build/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:1173]
>>> [default] [JOB
>>> 1748] Compacted 1@0 + 1@1 files to L1 => 14274825 bytes
>>> 2018-07-21 03:53:33.404460 7f25b5406700  3 rocksdb:
>>> [/build/ceph-12.2.7/src/rocksdb/db/version_set.cc:2087] More
>>> existing
>>> levels in DB than needed. max_bytes_for_level_multiplier may not be
>>> guaranteed.
>>> 2018-07-21 03:53:33.408360 7f25b5406700  4 rocksdb: (Original Log
>>> Time
>>> 2018/07/21-03:53:33.408228) [/build/ceph-12.2.7/src/
>>> rocksdb/db/compaction_job.cc:621] [default] compacted to: base
>>> level 1
>>> max bytes base 26843546 files[0 1 0 0 0 0 0]
>>> 2018-07-21 03:53:33.408381 7f25b5406700  4 rocksdb: (Original Log
>>> Time
>>> 2018/07/21-03:53:33.408275) EVENT_LOG_v1 {"time_micros":
>>> 1532138013408255,
>>> "job": 1748, "event": "compaction_finished",
>>> "compaction_time_micros":
>>> 84964, "output_level"
>>> 2018-07-21 03:53:33.408647 7f25b5406700  4 rocksdb: EVENT_LOG_v1
>>> {"time_micros": 1532138013408641, "job": 1748, "event":
>>> "table_file_deletion", "file_number": 149510}
>>> 2018-07-21 03:53:33.413854 7f25b5406700  4 rocksdb: EVENT_LOG_v1
>>> {"time_micros": 1532138013413849, "job": 1748, "event":
>>> "table_file_deletion", "file_number": 149508}
>>> 2018-07-21 03:54:27.634782 7f25bdc17700  0 mon.oxygene@3(peon).data
>>> _health(66142)
>>> update_stats avail 79% total 4758 MB, used 991 MB, avail 3766 MB
>>> 2018-07-21 03:55:27.635318 7f25bdc17700  0 mon.oxygene@3(peon).data
>>> _health(66142)
>>> update_stats avail 79% total 4758 MB, used 991 MB, avail 3766 MB
>>> 2018-07-21 03:56:27.635923 7f25bdc17700  0 mon.oxygene@3(peon).data
>>> _health(66142)
>>> update_stats avail 79% total 4758 MB, used 991 MB, avail 3766 MB
>>> 2018-07-21 03:57:27.636464 7f25bdc17700  0 mon.oxygene@3(peon).data
>>> _health(66142)
>>> update_stats avail 79% total 4758 MB, used 991 MB, avail 3766 MB
>>>
>>> I can see no evidence of intrusion or anything (network or
>>> physical).
>>> I'm not even sure it was a shutdown more than a hard reset, but no
>>> evidence of any fsck replaying any journal during reboot either.
>>> The server restarted without problem and the cluster is now
>>> HEALTH_OK.
>>>
>>> Hardware:
>>> * ASRock Rack mobos (the BMC/IPMI may have reset the server for no
>>> reason)
>>> * Western Digital ST4000VN008 OSD drives
>>>
>>> --
>>> Nicolas Huillard
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com