3.3.1 Replicate only replicating one way

marcus at synchromedia.co.uk (Marcus Bointon) · Fri, 1 Mar 2013 01:37:42 +0100

I've given up on trying to upgrade a 3.2.5 installation to 3.3.1 directly, so I'm scrapping it and starting again. I'm on Ubuntu Lucid, using stock packages from the semiosis ppa.

My config is very simple - 2 nodes running replicate on a single volume with 4G of small files, created like this:

gluster volume create shared replica 2 transport tcp 192.168.0.8:/var/shared 192.168.0.34:/var/shared

I copied off all files from the gluster volume, removed all signs of gluster 3.2.5, installed 3.3.1, reconfigured using the same commands as for 3.2.5. Install, peer probe, volume creation and mount (via NFS) all reported working correctly. The problem I'm now seeing is that I can touch a file on one side and it appears on the other, but not the other way around.

If I ask for heal info on the volume, both nodes report zero differences, but ls shows there are! If I request a full heal, the files appear correctly and the fixed files appear in the healed list. Something is clearly not talking...

I doubt it's a firewall issue since this was previously a working setup and the firewall hasn't been touched.

I'm finding it hard to track down since gluster's logs are spread across so many places - just this simple config has 20+ logs - and I've not found anything to explain this behaviour.

Node 1:

# gluster peer status
Number of Peers: 1

Hostname: 192.168.0.8
Uuid: 8f30902f-f125-47bc-87dd-fa48e583efd3
State: Peer in Cluster (Connected)

# gluster volume status
Status of volume: shared
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick 192.168.0.8:/var/shared                          24010   Y       22440
Brick 192.168.0.34:/var/shared                          24009   Y       16957
NFS Server on localhost                                 38467   Y       16963
Self-heal Daemon on localhost                           N/A     Y       16969
NFS Server on 192.168.0.8                              38467   Y       22446
Self-heal Daemon on 192.168.0.8                        N/A     Y       22452

Node 2:

# gluster peer status
Number of Peers: 1

Hostname: 192.168.0.34
Uuid: cf6d4c23-a5a2-4c35-859c-52410b6429e1
State: Peer in Cluster (Connected)

# gluster volume status
Status of volume: shared
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick 192.168.0.8:/var/shared                          24010   Y       22440
Brick 192.168.0.34:/var/shared                          24009   Y       16957
NFS Server on localhost                                 38467   Y       22446
Self-heal Daemon on localhost                           N/A     Y       22452
NFS Server on 192.168.0.34                              38467   Y       16963
Self-heal Daemon on 192.168.0.34                        N/A     Y       16969

Having said all that, I've just noticed that files *are* appearing on the other node in the direction I thought they were not - but it's *really* slow; I copied about 10,000 files onto it and they are all visible on one node, but after 30 minutes only 10% of them are present on the other node, and they are all listed in the 'info healed' output. This sounds to me as if the replication is only happening in one direction via self-heal, and not through the normal replication route - it's certainly not synchronous. Any idea what could be amiss?

Marcus
-- 
Marcus Bointon
Synchromedia Limited: Creators of http://www.smartmessages.net/
UK info at hand CRM solutions
marcus at synchromedia.co.uk | http://www.synchromedia.co.uk/