Data corruption

rags at kongulo.net (Bob Black) · Thu, 13 Sep 2012 11:54:23 +0000

Hello community of Gluster,

Sorry for the long post.
TL;DR: stock gluster 3.3.0 on 5 nodes results in massive data
corruption on brick "failure" or peer disconnection.

We are having problems with data corruption on VM volumes with VMs
running on top of Gluster 3.3.0 when introducing brick failures and/or
node disconnects.
In our setup we have 5 storage nodes with 16 core AMD Opteron(tm)
Processor 6128, 32GB ram and 34 2TB SATA disks. To utilize the the
storage nodes we have 20 compute nodes with 24 core AMD Opteron(TM)
Processor 6238 and 128GB ram.

To be able to test and verify this setup we installed Gluster 3.3.0 on
the storage nodes and GlusterFS 3.3.0 client on the compute nodes.
We created one brick for each hard drive and created
Distributed-Replicate volume with the bricks with tcp,rdma transport.
The volume was mounted with glusterfs over tcp transport over
infiniband on all the compute nodes. We created 500 virtual machines
on the compute nodes and made them do heavy IO benchmarking on the
volume and Gluster performed as expected.
Then we created sanity test script that creates files, copies over and
over again and does md5 sums of all written data and does md5 check of
all the operating system. We ran this test on all the VMs
successfully, then we did it again and stopped one storage node for
few minutes and started it again, which gluster recovered from
successfully.
Then we ran this test again but with kill -9 on all Gluster processes
on one node for more than an hour. We kept the tests running to
emulate load and then started the Gluster deamon on the storage node
again. Now around 10% of all VMs lost connection to Gluster and failed
to "read-only" file-system and more instances got some data
corruption, missing or broken files. Very bad!

We wiped the VMs and created new ones instead. Started the same test
again but now we terminated 4 bricks on one node and carried out load
testing to test shrinking and re-balancing. Before we got the chance
to remove/move bricks we started getting bunch of corrupted VMs and
data corruption and after re-balancing we got a load of kernel panics
on the VMs. Very bad indeed!

Are anyone else having the same problem, is there anything we are
doing wrong, is this lost cause?

Thanks for any input.
-Bob