Hi there,
I changed the SSD on the problematic node with the new one and reconfigure OSDs and MON service on it."rocksdb: submit_transaction error: Corruption: block checksum mismatch code = 2"
On Tue, Feb 20, 2018 at 5:16 PM, Behnam Loghmani <behnam.loghmani@xxxxxxxxx> wrote:
Hi Caspar,I checked the filesystem and there isn't any error on filesystem.The disk is SSD and it doesn't any attribute related to Wear level in smartctl and filesystem is mounted with default options and no discard.my ceph structure on this node is like this:
it has osd,mon,rgw services1 SSD for OS and WAL/DB2 HDDOSDs are created by ceph-volume lvm.the whole SSD is on 1 vg.OS is on root lvOSD.1 DB is on db-aOSD.1 WAL is on wal-aOSD.2 DB is on db-bOSD.2 WAL is on wal-boutput of lvs:
data-a data-a -wi-a-----
data-b data-b -wi-a-----
db-a vg0 -wi-a-----
db-b vg0 -wi-a-----
root vg0 -wi-ao----
wal-a vg0 -wi-a-----
wal-b vg0 -wi-a-----after making a heavy write on the radosgw, OSD.1 and OSD.2 has stopped with "block checksum mismatch" error.Now on this node MON and OSDs services has stopped working with this errorI think my issue is related to this bug: http://tracker.ceph.com/issues/22102 I ran
#ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-1 --deep 1but it returns the same error:
*** Caught signal (Aborted) **
in thread 7fbf6c923d00 thread_name:ceph-bluestore-
2018-02-20 16:44:30.128787 7fbf6c923d00 -1 abort: Corruption: block checksum mismatch
ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144 b094b7e5ba) luminous (stable)
1: (()+0x3eb0b1) [0x55f779e6e0b1]
2: (()+0xf5e0) [0x7fbf61ae15e0]
3: (gsignal()+0x37) [0x7fbf604d31f7]
4: (abort()+0x148) [0x7fbf604d48e8]
5: (RocksDBStore::get(std::string const&, char const*, unsigned long, ceph::buffer::list*)+0x1ce) [0x55f779d2b5ce]
6: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x545) [0x55f779cd8f75]
7: (BlueStore::_fsck(bool, bool)+0x1bb5) [0x55f779cf1a75]
8: (main()+0xde0) [0x55f779baab90]
9: (__libc_start_main()+0xf5) [0x7fbf604bfc05]
10: (()+0x1bc59f) [0x55f779c3f59f]
2018-02-20 16:44:30.131334 7fbf6c923d00 -1 *** Caught signal (Aborted) **
in thread 7fbf6c923d00 thread_name:ceph-bluestore-
ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144 b094b7e5ba) luminous (stable)
1: (()+0x3eb0b1) [0x55f779e6e0b1]
2: (()+0xf5e0) [0x7fbf61ae15e0]
3: (gsignal()+0x37) [0x7fbf604d31f7]
4: (abort()+0x148) [0x7fbf604d48e8]
5: (RocksDBStore::get(std::string const&, char const*, unsigned long, ceph::buffer::list*)+0x1ce) [0x55f779d2b5ce]
6: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x545) [0x55f779cd8f75]
7: (BlueStore::_fsck(bool, bool)+0x1bb5) [0x55f779cf1a75]
8: (main()+0xde0) [0x55f779baab90]
9: (__libc_start_main()+0xf5) [0x7fbf604bfc05]
10: (()+0x1bc59f) [0x55f779c3f59f]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
-1> 2018-02-20 16:44:30.128787 7fbf6c923d00 -1 abort: Corruption: block checksum mismatch
0> 2018-02-20 16:44:30.131334 7fbf6c923d00 -1 *** Caught signal (Aborted) **
in thread 7fbf6c923d00 thread_name:ceph-bluestore-
ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144 b094b7e5ba) luminous (stable)
1: (()+0x3eb0b1) [0x55f779e6e0b1]
2: (()+0xf5e0) [0x7fbf61ae15e0]
3: (gsignal()+0x37) [0x7fbf604d31f7]
4: (abort()+0x148) [0x7fbf604d48e8]
5: (RocksDBStore::get(std::string const&, char const*, unsigned long, ceph::buffer::list*)+0x1ce) [0x55f779d2b5ce]
6: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x545) [0x55f779cd8f75]
7: (BlueStore::_fsck(bool, bool)+0x1bb5) [0x55f779cf1a75]
8: (main()+0xde0) [0x55f779baab90]
9: (__libc_start_main()+0xf5) [0x7fbf604bfc05]
10: (()+0x1bc59f) [0x55f779c3f59f]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.Could you please help me to recover this node or find a way to prove SSD disk problem.Best regards,Behnam LoghmaniOn Mon, Feb 19, 2018 at 1:35 PM, Caspar Smit <casparsmit@xxxxxxxxxxx> wrote:Hi Behnam,I would firstly recommend running a filesystem check on the monitor disk first to see if there are any inconsistencies.Is the disk where the monitor is running on a spinning disk or SSD?If SSD you should check the Wear level stats through smartctl.Maybe trim (discard) enabled on the filesystem mount? (discard could cause problems/corruption in combination with certain SSD firmwares)Caspar2018-02-16 23:03 GMT+01:00 Behnam Loghmani <behnam.loghmani@xxxxxxxxx>:I checked the disk that monitor is on it with smartctl and it didn't return any error and it doesn't have any Current_Pending_Sector.Do you recommend any disk checks to make sure that this disk has problem and then I can send the report to the provider for replacing the diskOn Sat, Feb 17, 2018 at 1:09 AM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:The disk that the monitor is on...there isn't anything for you to configure about a monitor WAL though so I'm not sure how that enters into it?On Fri, Feb 16, 2018 at 12:46 PM Behnam Loghmani <behnam.loghmani@xxxxxxxxx> wrote:Thanks for your replyDo you mean, that's the problem with the disk I use for WAL and DB?On Fri, Feb 16, 2018 at 11:33 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:On Fri, Feb 16, 2018 at 7:37 AM Behnam Loghmani <behnam.loghmani@xxxxxxxxx> wrote:It is a testing cluster and I have set it up 2 weeks ago.Hi there,I have a Ceph cluster version 12.2.2 on CentOS 7.after some days, I see that one of the three mons has stopped(out of quorum) and I can't start it anymore.I checked the mon service log and the output shows this error:
"""
mon.XXXXXX@-1(probing) e4 preinit clean up potentially inconsistent store state
rocksdb: submit_transaction_sync error: Corruption: block checksum mismatchThis bit is the important one. Your disk is bad and it’s feeding back corrupted data.______________________________
code = 2 Rocksdb transaction:
0> 2018-02-16 17:37:07.041812 7f45a1e52e40 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/A VAILABLE_ARCH/x86_64/AVAILABLE _DIST/centos7/DIST/centos7/MAC HINE_SIZE/huge/release/12.2.2/ rpm/el7/BUI
LD/ceph-12.2.2/src/mon/MonitorDBStore.h: In function 'void MonitorDBStore::clear(std::set <std::basic_string<char> >&)' thread 7f45a1e52e40 time 2018-02-16 17:37:07.040846
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/A VAILABLE_ARCH/x86_64/AVAILABLE _DIST/centos7/DIST/centos7/MAC HINE_SIZE/huge/release/12.2.2/ rpm/el7/BUILD/ceph-12.2.2/src/ mon/MonitorDBStore.h: 581: FAILE
D assert(r >= 0)
"""the only solution I found is to remove this mon from quorum and remove all mon data and re-add this mon to quorum again.and ceph goes to the healthy status again.
but now after some days this mon has stopped and I face the same problem again.My cluster setup is:4 osd hoststotal 8 osds3 mons1 rgwthis cluster has setup with ceph-volume lvm and wal/db separation on logical volumes.Best regards,Behnam Loghmani_________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com