Hi all, The latest hint I received (thanks!) was to replace a failing hardware. Before that, I updated the BIOS, which included a CPU microcode fix for melddown/spectre and probably other thngs. Last time I had checked, the vendor didn't have that fix yet. Since this update, not CATERR happened... This Intel microcode + vendor BIOS may have mitigated the problem, and postpones hardware replacement... Le mardi 24 juillet 2018 à 12:18 +0200, Nicolas Huillard a écrit : > Hi all, > > The same server did it again with the same CATERR exactly 3 days > after > rebooting (+/- 30 seconds). > If it were'nt for the exact +3 days, I would think it's a random > event. > But exactly 3 days after reboot does not seem random. > > Nothing I added got me more information (mcelog, pstore, BMC video > record, etc.)... > > Thanks is advance for any hint ;-) > > Le samedi 21 juillet 2018 à 10:31 +0200, Nicolas Huillard a écrit : > > Hi all, > > > > One of my server silently shutdown last night, with no explanation > > whatsoever in any logs. According to the existing logs, the > > shutdown > > (without reboot) happened between 03:58:20.061452 (last timestamp > > from > > /var/log/ceph/ceph-mgr.oxygene.log) and 03:59:01.515308 (new MON > > election called, for which oxygene didn't answer). > > > > Is there any way in which Ceph could silently shutdown a server? > > Can SMART self-test influence scrubbing or compaction? > > > > The only thing I have is that smartd stated a long self-test on > > both > > OSD spinning drives on that host: > > Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sda [SAT], > > starting > > scheduled Long Self-Test. > > Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdb [SAT], > > starting > > scheduled Long Self-Test. > > Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdc [SAT], > > starting > > scheduled Long Self-Test. > > Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sda [SAT], self- > > test in progress, 90% remaining > > Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdb [SAT], self- > > test in progress, 90% remaining > > Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdc [SAT], > > previous > > self-test completed without error > > > > ...and smartctl now says that the self-tests didn't finish (on both > > drives) : > > # 1 Extended offline Interrupted (host > > reset) 00% 10636 - > > > > MON logs on oxygene talks about rockdb compaction a few minutes > > before > > the shutdown, and a deep-scrub finished earlier: > > /var/log/ceph/ceph-osd.6.log > > 2018-07-21 03:32:54.086021 7fd15d82c700 0 log_channel(cluster) log > > [DBG] : 6.1d deep-scrub starts > > 2018-07-21 03:34:31.185549 7fd15d82c700 0 log_channel(cluster) log > > [DBG] : 6.1d deep-scrub ok > > 2018-07-21 03:43:36.720707 7fd178082700 0 -- > > 172.22.0.16:6801/478362 > > > > 172.21.0.16:6800/1459922146 conn(0x556f0642b800 :6801 > > > > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 > > l=1).handle_connect_msg: challenging authorizer > > > > /var/log/ceph/ceph-mgr.oxygene.log > > 2018-07-21 03:58:16.060137 7fbcd3777700 1 mgr send_beacon standby > > 2018-07-21 03:58:18.060733 7fbcd3777700 1 mgr send_beacon standby > > 2018-07-21 03:58:20.061452 7fbcd3777700 1 mgr send_beacon standby > > > > /var/log/ceph/ceph-mon.oxygene.log > > 2018-07-21 03:52:27.702314 7f25b5406700 4 rocksdb: (Original Log > > Time 2018/07/21-03:52:27.702302) [/build/ceph- > > 12.2.7/src/rocksdb/db/db_impl_compaction_flush.cc:1392] [default] > > Manual compaction from level-0 to level-1 from 'mgrstat .. ' > > 2018-07-21 03:52:27.702321 7f25b5406700 4 rocksdb: [/build/ceph- > > 12.2.7/src/rocksdb/db/compaction_job.cc:1403] [default] [JOB 1746] > > Compacting 1@0 + 1@1 files to L1, score -1.00 > > 2018-07-21 03:52:27.702329 7f25b5406700 4 rocksdb: [/build/ceph- > > 12.2.7/src/rocksdb/db/compaction_job.cc:1407] [default] Compaction > > start summary: Base version 1745 Base level 0, inputs: > > [149507(602KB)], [149505(13MB)] > > 2018-07-21 03:52:27.702348 7f25b5406700 4 rocksdb: EVENT_LOG_v1 > > {"time_micros": 1532137947702334, "job": 1746, "event": > > "compaction_started", "files_L0": [149507], "files_L1": [149505], > > "score": -1, "input_data_size": 14916379} > > 2018-07-21 03:52:27.785532 7f25b5406700 4 rocksdb: [/build/ceph- > > 12.2.7/src/rocksdb/db/compaction_job.cc:1116] [default] [JOB 1746] > > Generated table #149508: 4904 keys, 14808953 bytes > > 2018-07-21 03:52:27.785587 7f25b5406700 4 rocksdb: EVENT_LOG_v1 > > {"time_micros": 1532137947785565, "cf_name": "default", "job": > > 1746, > > "event": "table_file_creation", "file_number": 149508, "file_size": > > 14808953, "table_properties": {"data > > 2018-07-21 03:52:27.785627 7f25b5406700 4 rocksdb: [/build/ceph- > > 12.2.7/src/rocksdb/db/compaction_job.cc:1173] [default] [JOB 1746] > > Compacted 1@0 + 1@1 files to L1 => 14808953 bytes > > 2018-07-21 03:52:27.785656 7f25b5406700 3 rocksdb: [/build/ceph- > > 12.2.7/src/rocksdb/db/version_set.cc:2087] More existing levels in > > DB > > than needed. max_bytes_for_level_multiplier may not be guaranteed. > > 2018-07-21 03:52:27.791640 7f25b5406700 4 rocksdb: (Original Log > > Time 2018/07/21-03:52:27.791526) [/build/ceph- > > 12.2.7/src/rocksdb/db/compaction_job.cc:621] [default] compacted > > to: > > base level 1 max bytes base 26843546 files[0 1 0 0 0 0 0] > > 2018-07-21 03:52:27.791657 7f25b5406700 4 rocksdb: (Original Log > > Time 2018/07/21-03:52:27.791563) EVENT_LOG_v1 {"time_micros": > > 1532137947791548, "job": 1746, "event": "compaction_finished", > > "compaction_time_micros": 83261, "output_level" > > 2018-07-21 03:52:27.792024 7f25b5406700 4 rocksdb: EVENT_LOG_v1 > > {"time_micros": 1532137947792019, "job": 1746, "event": > > "table_file_deletion", "file_number": 149507} > > 2018-07-21 03:52:27.796596 7f25b5406700 4 rocksdb: EVENT_LOG_v1 > > {"time_micros": 1532137947796592, "job": 1746, "event": > > "table_file_deletion", "file_number": 149505} > > 2018-07-21 03:52:27.796690 7f25b6408700 4 rocksdb: [/build/ceph- > > 12.2.7/src/rocksdb/db/db_impl_compaction_flush.cc:839] [default] > > Manual compaction starting > > ... > > 2018-07-21 03:53:33.404428 7f25b5406700 4 rocksdb: [/build/ceph- > > 12.2.7/src/rocksdb/db/compaction_job.cc:1173] [default] [JOB 1748] > > Compacted 1@0 + 1@1 files to L1 => 14274825 bytes > > 2018-07-21 03:53:33.404460 7f25b5406700 3 rocksdb: [/build/ceph- > > 12.2.7/src/rocksdb/db/version_set.cc:2087] More existing levels in > > DB > > than needed. max_bytes_for_level_multiplier may not be guaranteed. > > 2018-07-21 03:53:33.408360 7f25b5406700 4 rocksdb: (Original Log > > Time 2018/07/21-03:53:33.408228) [/build/ceph- > > 12.2.7/src/rocksdb/db/compaction_job.cc:621] [default] compacted > > to: > > base level 1 max bytes base 26843546 files[0 1 0 0 0 0 0] > > 2018-07-21 03:53:33.408381 7f25b5406700 4 rocksdb: (Original Log > > Time 2018/07/21-03:53:33.408275) EVENT_LOG_v1 {"time_micros": > > 1532138013408255, "job": 1748, "event": "compaction_finished", > > "compaction_time_micros": 84964, "output_level" > > 2018-07-21 03:53:33.408647 7f25b5406700 4 rocksdb: EVENT_LOG_v1 > > {"time_micros": 1532138013408641, "job": 1748, "event": > > "table_file_deletion", "file_number": 149510} > > 2018-07-21 03:53:33.413854 7f25b5406700 4 rocksdb: EVENT_LOG_v1 > > {"time_micros": 1532138013413849, "job": 1748, "event": > > "table_file_deletion", "file_number": 149508} > > 2018-07-21 03:54:27.634782 7f25bdc17700 0 mon.oxygene@3(peon).data > > _h > > ealth(66142) update_stats avail 79% total 4758 MB, used 991 MB, > > avail > > 3766 MB > > 2018-07-21 03:55:27.635318 7f25bdc17700 0 mon.oxygene@3(peon).data > > _h > > ealth(66142) update_stats avail 79% total 4758 MB, used 991 MB, > > avail > > 3766 MB > > 2018-07-21 03:56:27.635923 7f25bdc17700 0 mon.oxygene@3(peon).data > > _h > > ealth(66142) update_stats avail 79% total 4758 MB, used 991 MB, > > avail > > 3766 MB > > 2018-07-21 03:57:27.636464 7f25bdc17700 0 mon.oxygene@3(peon).data > > _h > > ealth(66142) update_stats avail 79% total 4758 MB, used 991 MB, > > avail > > 3766 MB > > > > I can see no evidence of intrusion or anything (network or > > physical). > > I'm not even sure it was a shutdown more than a hard reset, but no > > evidence of any fsck replaying any journal during reboot either. > > The server restarted without problem and the cluster is now > > HEALTH_OK. > > > > Hardware: > > * ASRock Rack mobos (the BMC/IPMI may have reset the server for no > > reason) > > * Western Digital ST4000VN008 OSD drives > > -- Nicolas Huillard _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com