On 7-8-2018 16:04, Sage Weil wrote: > On Tue, 7 Aug 2018, Willem Jan Withagen wrote: >> Hi, >> >> On my test-cluster I had some problems, probably due to heartbeat >> timeout problems that threw OSD out. >> >> But now I have this other problem, first crash was probably also a >> suicide timeout, and the OSD does not want to restart: >> >> First lost of journal_replay, and then an assert. >> What is smart to do, try to fix this (if possible) or trash the OSD (or >> even the cluster) and go on with life? >> >> The first option would perhaps be rather educational. >> >> --WjW >> >> -8> 2018-08-07 12:20:15.919393 a1e6480 3 journal journal_replay: r >> = 0, op_seq now 772232 >> -7> 2018-08-07 12:20:15.919417 a1e6480 2 journal read_entry >> 5191901184 : seq 772233 1182 bytes >> -6> 2018-08-07 12:20:15.919423 a1e6480 3 journal journal_replay: >> applying op seq 772233 >> -5> 2018-08-07 12:20:15.919463 a1e6480 3 journal journal_replay: r >> = 0, op_seq now 772233 >> -4> 2018-08-07 12:20:15.919486 a1e6480 2 journal read_entry >> 5191905280 : seq 772234 1131 bytes >> -3> 2018-08-07 12:20:15.919491 a1e6480 3 journal journal_replay: >> applying op seq 772234 >> -2> 2018-08-07 12:20:15.919526 a1e6480 3 journal journal_replay: r >> = 0, op_seq now 772234 >> -1> 2018-08-07 12:20:19.807879 a1e6480 -1 journal >> FileJournal::wrap_read_bl: safe_read_exact 5191913458~4196316 returned -5 > > I think if the journal read gets EIO that's a real EIO. Anything in your > kernel log? > > I would blow away this OSD and move on... Good suggestion. # zpool status osd2 pool: osd2 state: DEGRADED status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://illumos.org/msg/ZFS-8000-8A scan: none requested config: NAME STATE READ WRITE CKSUM osd2 DEGRADED 88 0 0 gpt/osd2 ONLINE 88 0 0 logs 8012054376424389601 REMOVED 0 0 0 was /dev/gpt/log2 cache 16016892930842715982 REMOVED 0 0 0 was /dev/gpt/cacheosd2 Obviously the SSD over the ZFS volume has died, and since I was also using a ZLOG on ssd that killed (part) of the volume. But then the disk controller also has trouble with the actual spindle tahat osd.2 is on. So there is definitely something wrong with this zfs volume. errors: 22 data errors, use '-v' for a list Moving on, Thanx, --WjW > sage > >> 0> 2018-08-07 12:20:19.808465 a1e6480 -1 *** Caught signal (Abort >> trap) ** >> in thread a1e6480 thread_name: >> >> ceph version 12.2.4 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous >> (stable) >> 1: <install_standard_sighandlers(void)+0x417> at /usr/local/bin/ceph-osd >> 2: <pthread_sigmask()+0x536> at /lib/libthr.so.3 >> 3: <pthread_getspecific()+0xe12> at /lib/libthr.so.3 >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is >> needed to interpret this. >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html