Issue with Pro active self healing for Erasure coding

Mohamed Pakkeer <mdfakkeer@xxxxxxxxx> · Tue, 26 May 2015 13:45:38 +0530

Hi Glusterfs Experts,
We are testing glusterfs 3.7.0 tarball on our 10 Node glusterfs cluster. Each node has 36 dirves and please find the volume info below

Volume Name: vaulttest5
Type: Distributed-Disperse
Volume ID: 68e082a6-9819-4885-856c-1510cd201bd9
Status: Started
Number of Bricks: 36 x (8 + 2) = 360
Transport-type: tcp
Bricks:
Brick1: 10.1.2.1:/media/disk1
Brick2: 10.1.2.2:/media/disk1
Brick3: 10.1.2.3:/media/disk1
Brick4: 10.1.2.4:/media/disk1
Brick5: 10.1.2.5:/media/disk1
Brick6: 10.1.2.6:/media/disk1
Brick7: 10.1.2.7:/media/disk1
Brick8: 10.1.2.8:/media/disk1
Brick9: 10.1.2.9:/media/disk1
Brick10: 10.1.2.10:/media/disk1
Brick11: 10.1.2.1:/media/disk2
Brick12: 10.1.2.2:/media/disk2
Brick13: 10.1.2.3:/media/disk2
Brick14: 10.1.2.4:/media/disk2
Brick15: 10.1.2.5:/media/disk2
Brick16: 10.1.2.6:/media/disk2
Brick17: 10.1.2.7:/media/disk2
Brick18: 10.1.2.8:/media/disk2
Brick19: 10.1.2.9:/media/disk2
Brick20: 10.1.2.10:/media/disk2
...
....
Brick351: 10.1.2.1:/media/disk36
Brick352: 10.1.2.2:/media/disk36
Brick353: 10.1.2.3:/media/disk36
Brick354: 10.1.2.4:/media/disk36
Brick355: 10.1.2.5:/media/disk36
Brick356: 10.1.2.6:/media/disk36
Brick357: 10.1.2.7:/media/disk36
Brick358: 10.1.2.8:/media/disk36
Brick359: 10.1.2.9:/media/disk36
Brick360: 10.1.2.10:/media/disk36
Options Reconfigured:
performance.readdir-ahead: on

We did some performance testing and simulated the proactive self healing for Erasure coding. Disperse volume has been created across nodes.

Description of problem

I disconnected the network of two nodes and tried to write some video files and glusterfs wrote the video files on balance 8 nodes perfectly. I tried to download the uploaded file and it was downloaded perfectly. Then i enabled the network of two nodes, the pro active self healing mechanism worked perfectly and wrote the unavailable junk of data to the recently enabled node from the other 8 nodes. But when i tried to download the same file node, it showed Input/Output error. I couldn't download the file. I think there is an issue in pro active self healing.

Also we tried the simulation with one node network failure. We faced same I/O error issue while downloading the file

Error while downloading file 

root@master02:/home/admin# rsync -r
--progress /mnt/gluster/file13_AN ./1/file13_AN-2

sending incremental file list

file13_AN

  3,342,355,597 100%   
4.87MB/s    0:10:54 (xfr#1, to-chk=0/1)

rsync: read errors mapping "/mnt/gluster/file13_AN":
Input/output error (5)

WARNING: file13_AN failed verification -- update discarded
(will try again).

 root@master02:/home/admin# cp
/mnt/gluster/file13_AN ./1/file13_AN-3

cp: error reading ‘/mnt/gluster/file13_AN’: Input/output error

cp:
failed to extend ‘./1/file13_AN-3’: Input/output error

We can't conclude the issue with glusterfs 3.7.0 or our glusterfs configuration.

Any help would be greatly appreciated

-- 
Cheers
Backer

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users