This is my crash test scenerio:
1. hosts
server1 - member of gluster volume
server2 - member of gluster volume
client1 - gluster storage activity - reads: ~10/s, writes: ~10/s
client2 - gluster storage activity - reads: ~10/s, writes: ~10/s
2. AFR gluster storage (tested 3.2.5 and 3.3beta2)
# gluster volume info
Volume Name: data
Type: Replicate
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: server1:/fs/data
Brick2: server2:/fs/data
3. storage /fs/data:
~ 300 000 files (size < 10KB)
~ 3 GB
4. crash test scenerio
- server1 goes down
- clients got a few "Input/output error" (for read and write) and
continue working - fine
- server1 recovers (after ~3 minutes)
- clients got a few "Input/output error" (for read and write) - fine
- access to gluster storage from clients blocked (self-healing process -
a few minutes with my hardware configuration)
- during this self-heling process server2 goes down
- self-healing process interrupted and clients gain access to gluster data
- server2 recovers and real problems started
- clients: data inaccessible: permanent "Input/output error" for files
and directories
client1:
# ls -la a
ls: cannot access a: Input/output error
# ls -la
?????????? ? ? ? ? ? a
server1:
# getfattr -d -m . a
# file: a
trusted.afr.data2-client-0=0sAAAAAAAAAAAAAAAA
trusted.afr.data2-client-1=0sAAAAAAAAAAAAAAAq
trusted.gfid=0sfdlzd6TeRxelnMeCG9ut/w==
server2:
# getfattr -d -m . a
# file: a
trusted.afr.data2-client-0=0sAAAAAAAAAAAAAAA1
trusted.afr.data2-client-1=0sAAAAAAAAAAAAAAAA
trusted.gfid=0sfdlzd6TeRxelnMeCG9ut/w==
clients /var/log/glusterfs/data.log:
[2012-02-08 13:24:16.837976] I
[afr-self-heal-common.c:705:afr_mark_sources] 0-data2-replicate-0:
split-brain possible, no source detected
[2012-02-08 13:24:16.838079] W [fuse-bridge.c:184:fuse_entry_cbk]
0-glusterfs-fuse: 565416: LOOKUP() /a => -1 (Input/output error)
This kind of issues making gluster unusable in production system.
--
Robert