Hi Behnam,
I would firstly recommend running a filesystem check on the monitor disk first to see if there are any inconsistencies.
Is the disk where the monitor is running on a spinning disk or SSD?
If SSD you should check the Wear level stats through smartctl.
Maybe trim (discard) enabled on the filesystem mount? (discard could cause problems/corruption in combination with certain SSD firmwares)
Caspar
2018-02-16 23:03 GMT+01:00 Behnam Loghmani <behnam.loghmani@xxxxxxxxx>:
I checked the disk that monitor is on it with smartctl and it didn't return any error and it doesn't have any Current_Pending_Sector.Do you recommend any disk checks to make sure that this disk has problem and then I can send the report to the provider for replacing the diskOn Sat, Feb 17, 2018 at 1:09 AM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:The disk that the monitor is on...there isn't anything for you to configure about a monitor WAL though so I'm not sure how that enters into it?On Fri, Feb 16, 2018 at 12:46 PM Behnam Loghmani <behnam.loghmani@xxxxxxxxx> wrote:Thanks for your replyDo you mean, that's the problem with the disk I use for WAL and DB?On Fri, Feb 16, 2018 at 11:33 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:On Fri, Feb 16, 2018 at 7:37 AM Behnam Loghmani <behnam.loghmani@xxxxxxxxx> wrote:It is a testing cluster and I have set it up 2 weeks ago.Hi there,I have a Ceph cluster version 12.2.2 on CentOS 7.after some days, I see that one of the three mons has stopped(out of quorum) and I can't start it anymore.I checked the mon service log and the output shows this error:
"""
mon.XXXXXX@-1(probing) e4 preinit clean up potentially inconsistent store state
rocksdb: submit_transaction_sync error: Corruption: block checksum mismatchThis bit is the important one. Your disk is bad and it’s feeding back corrupted data.______________________________
code = 2 Rocksdb transaction:
0> 2018-02-16 17:37:07.041812 7f45a1e52e40 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/ AVAILABLE_ARCH/x86_64/AVAILABL E_DIST/centos7/DIST/centos7/ MACHINE_SIZE/huge/release/12. 2.2/rpm/el7/BUI
LD/ceph-12.2.2/src/mon/MonitorDBStore.h: In function 'void MonitorDBStore::clear(std::set <std::basic_string<char> >&)' thread 7f45a1e52e40 time 2018-02-16 17:37:07.040846
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/ AVAILABLE_ARCH/x86_64/AVAILABL E_DIST/centos7/DIST/centos7/ MACHINE_SIZE/huge/release/12. 2.2/rpm/el7/BUILD/ceph-12.2.2/ src/mon/MonitorDBStore.h: 581: FAILE
D assert(r >= 0)
"""the only solution I found is to remove this mon from quorum and remove all mon data and re-add this mon to quorum again.and ceph goes to the healthy status again.
but now after some days this mon has stopped and I face the same problem again.My cluster setup is:4 osd hoststotal 8 osds3 mons1 rgwthis cluster has setup with ceph-volume lvm and wal/db separation on logical volumes.Best regards,Behnam Loghmani_________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com