Help needed in understanding GlusterFS logs and debugging elasticsearch failures

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I was trying to use GlusterFS as a backend filesystem for storing the 
elasticsearch indices on GlusterFS mount. The filesystem operations as far as I can understand is, lucene engine
does a lot of renames on the index files. And multiple threads read
from the same file concurrently. While writing index, elasticsearch/lucene complains of index corruption and the health of the cluster goes to red, and all the operations on the index fail
hereafter. =================== [2015-12-10 02:43:45,614][WARN ][index.engine ] [client-2] [logstash-2015.12.09][3] failed engine [merge failed] org.apache.lucene.index.MergePolicy$MergeException: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=0 actual=6d811d06 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/mnt/gluster2/rhs/nodes/0/indices/logstash-2015.12.09/3/index/_a7.cfs") [slice=_a7_Lucene50_0.doc])) at org.elasticsearch.index.engine.InternalEngine$EngineMergeScheduler$1.doRun(InternalEngine.java:1233) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=0 actual=6d811d06 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/mnt/gluster2/rhs/nodes/0/indices/logstash-2015.12.09/3/index/_a7.cfs") [slice=_a7_Lucene50_0.doc])) ===================== Server logs does not have anything. The client logs is full of messages like: [2015-12-03 18:44:17.882032] I [MSGID: 109066] [dht-rename.c:1410:dht_rename] 0-esearch-dht: renaming /rhs/nodes/0/indices/logstash-2015.12.03/1/translog/translog-61881676454442626.tlog (hash=esearch-replicate-0/cache=esearch-replicate-0) => /rhs/nodes/0/indices/logstash-2015.12.03/1/translog/translog-311.ckp (hash=esearch-replicate-1/cache=<nul>) [2015-12-03 18:45:31.276316] I [MSGID: 109066] [dht-rename.c:1410:dht_rename] 0-esearch-dht: renaming /rhs/nodes/0/indices/logstash-2015.12.03/1/translog/translog-2384654015514619399.tlog (hash=esearch-replicate-0/cache=esearch-replicate-0) => /rhs/nodes/0/indices/logstash-2015.12.03/1/translog/translog-312.ckp (hash=esearch-replicate-0/cache=<nul>) [2015-12-03 18:45:31.587660] I [MSGID: 109066] [dht-rename.c:1410:dht_rename] 0-esearch-dht: renaming /rhs/nodes/0/indices/logstash-2015.12.03/4/translog/translog-4957943728738197940.tlog (hash=esearch-replicate-0/cache=esearch-replicate-0) => /rhs/nodes/0/indices/logstash-2015.12.03/4/translog/translog-312.ckp (hash=esearch-replicate-0/cache=<nul>) [2015-12-03 18:46:48.424605] I [MSGID: 109066] [dht-rename.c:1410:dht_rename] 0-esearch-dht: renaming /rhs/nodes/0/indices/logstash-2015.12.03/1/translog/translog-1731620600607498012.tlog (hash=esearch-replicate-1/cache=esearch-replicate-1) => /rhs/nodes/0/indices/logstash-2015.12.03/1/translog/translog-313.ckp (hash=esearch-replicate-1/cache=<nul>) [2015-12-03 18:46:48.466558] I [MSGID: 109066] [dht-rename.c:1410:dht_rename] 0-esearch-dht: renaming /rhs/nodes/0/indices/logstash-2015.12.03/4/translog/translog-5214949393126318982.tlog (hash=esearch-replicate-1/cache=esearch-replicate-1) => /rhs/nodes/0/indices/logstash-2015.12.03/4/translog/translog-313.ckp (hash=esearch-replicate-1/cache=<nul>) [2015-12-03 18:48:06.314138] I [MSGID: 109066] [dht-rename.c:1410:dht_rename] 0-esearch-dht: renaming /rhs/nodes/0/indices/logstash-2015.12.03/4/translog/translog-9110755229226773921.tlog (hash=esearch-replicate-0/cache=esearch-replicate-0) => /rhs/nodes/0/indices/logstash-2015.12.03/4/translog/translog-314.ckp (hash=esearch-replicate-1/cache=<nul>) [2015-12-03 18:48:06.332919] I [MSGID: 109066] [dht-rename.c:1410:dht_rename] 0-esearch-dht: renaming /rhs/nodes/0/indices/logstash-2015.12.03/1/translog/translog-5193443717817038271.tlog (hash=esearch-replicate-1/cache=esearch-replicate-1) => /rhs/nodes/0/indices/logstash-2015.12.03/1/translog/translog-314.ckp (hash=esearch-replicate-1/cache=<nul>) [2015-12-03 18:49:24.694263] I [MSGID: 109066] [dht-rename.c:1410:dht_rename] 0-esearch-dht: renaming /rhs/nodes/0/indices/logstash-2015.12.03/1/translog/translog-2750483795035758522.tlog (hash=esearch-replicate-1/cache=esearch-replicate-1) => /rhs/nodes/0/indices/logstash-2015.12.03/1/translog/translog-315.ckp (hash=esearch-replicate-0/cache=<nul>) ============================================================== The same setup works well on any of the disk filesystems. This is 2 x 2 distributed-replicate setup: # gluster vol info Volume Name: esearch Type: Distributed-Replicate Volume ID: 4e4b205e-28ed-4f9e-9fa4-0d020428dede Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp,rdma Bricks: Brick1: 10.70.47.171:/gluster/brick1 Brick2: 10.70.47.187:/gluster/brick1 Brick3: 10.70.47.121:/gluster/brick1 Brick4: 10.70.47.172:/gluster/brick1 Options Reconfigured: performance.read-ahead: off performance.write-behind: off I need a little bit help in understanding the failures. Let me know if you need further information on setup or access to the system to debug further. I've attached the debug logs for further investigation. -sac

Attachment: mnt-gluster.log.bz2
Description: BZip2 compressed data

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel

[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux