Concurrent writes to same file

frank.gruellich at here.com (frank.gruellich at here.com) · Thu, 25 Jul 2013 07:28:56 +0000

Hi,

we're trying to setup a web caching application on top of GlusterFS which is supposed to provide shared storage for the a bunch of servers (2 right now).  We have a similar setup running on NFS which works okay.  GlusterFS however is giving us some headaches.

We use two servers ($GLUFS001 and $GLUFS002) in AWS both having their brick FS mounted at /mnt/brick01/ with a vol01/ used as brick.  We tried XFS and EXT4 for the underlying FS.  GlusterFS is setup with:

sudo gluster volume create tst_vol_aws01 replica 2 transport tcp "$GLUFS001:/mnt/brick01/vol01/" "$GLUFS002:/mnt/brick01/vol01/"

That's my volume info at the end:

Volume Name: tst_vol_aws01
Type: Replicate
Volume ID: f4b060fa-d6a8-45d2-89c5-1d99637dcc8a
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: $GLUFS001:/mnt/brick01/vol01
Brick2: $GLUFS002:/mnt/brick01/vol01
Options Reconfigured:
performance.cache-size: 256MB

The clients are running on the very same machines mounting with glusterfs:

sudo mount -t glusterfs localhost:/tst_vol_aws01 /mnt/tst_vol_aws

which is used as storage for the web cache application.  This fetches misses from an origin and places it into the storage to use it next time and also make it available to the other web cache.  It does so by writing a file to a temporary, per client unique location (eg. /mnt/tst_vol_aws/tmp/$IPADDR/) and then moving the file into its final location (below /mnt/tst_vol_aws/sat_p_cache/).

As long as I send requests to only one of the instances everything is fine.  The trouble comes when I send requests to both of them and cause concurrent writes to the storage.  Though unlikely, it can happen that two clients try to put the very same file into their cache.  Needless to say that they fetch the same content from origin, so if one overwrites the others version it is not a problem. The error our application receives is:

pread() "/mnt/tst_vol_aws/sat_p_cache/9_158_309/18/169/80930_158240" failed (22: Invalid argument)

Around that I find various messages in /var/log/glusterfs/mnt-tst_vol_aws.log:

[2013-07-25 07:08:36.859467] W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-tst_vol_aws01-client-1: remote operation failed: St
ale file handle. Path: /sat_p_cache/9_158_309/19/682/161796_316418 (c19f0726-2c01-420f-b10f-926f8f55de43)
[2013-07-25 07:08:36.860412] W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-tst_vol_aws01-client-0: remote operation failed: St
ale file handle. Path: /sat_p_cache/9_158_309/19/682/161796_316418 (c19f0726-2c01-420f-b10f-926f8f55de43)
[2013-07-25 07:09:44.035227] W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-tst_vol_aws01-client-1: remote operation failed: St
ale file handle. Path: /sat_p_cache/9_158_309 (3628f18c-48c8-4b07-91cf-ff1922ab45f8)
[2013-07-25 07:09:44.035669] W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-tst_vol_aws01-client-0: remote operation failed: St
ale file handle. Path: /sat_p_cache/9_158_309 (3628f18c-48c8-4b07-91cf-ff1922ab45f8)
[2013-07-25 07:09:44.885802] I [afr-self-heal-common.c:1213:afr_sh_missing_entry_call_impunge_recreate] 0-tst_vol_aws01-replicate-0: no missing files - /sat_p_cache/9_158_309/18/169/80930_158240. proceeding to metadata check
[2013-07-25 07:09:44.892560] W [client-rpc-fops.c:259:client3_3_mknod_cbk] 0-tst_vol_aws01-client-1: remote operation failed: File exists. Path: /sat_p_cache/9_158_309/18/169/80930_158240 (00000000-0000-0000-0000-000000000000)
[2013-07-25 07:09:44.892994] W [client-rpc-fops.c:259:client3_3_mknod_cbk] 0-tst_vol_aws01-client-0: remote operation failed: File exists. Path: /sat_p_cache/9_158_309/18/169/80930_158240 (00000000-0000-0000-0000-000000000000)
[2013-07-25 07:09:44.893073] E [dht-helper.c:1052:dht_inode_ctx_get] (-->/usr/lib64/glusterfs/3.4.0/xlator/cluster/distribute.so(dht_lookup_linkfile_create_cbk+0x75) [0x7f541cd61025] (-->/usr/lib64/glusterfs/3.4.0/xlator/cluster/distribute.so(dht_layout_preset+0x5e) [0x7f541cd4801e] (-->/usr/lib64/glusterfs/3.4.0/xlator/cluster/distribute.so(dht_inode_ctx_layout_set+0x34) [0x7f541cd49364]))) 0-tst_vol_aws01-dht: invalid argument: inode
[2013-07-25 07:09:44.893106] E [dht-helper.c:1071:dht_inode_ctx_set] (-->/usr/lib64/glusterfs/3.4.0/xlator/cluster/distribute.so(dht_lookup_linkfile_create_cbk+0x75) [0x7f541cd61025] (-->/usr/lib64/glusterfs/3.4.0/xlator/cluster/distribute.so(dht_layout_preset+0x5e) [0x7f541cd4801e] (-->/usr/lib64/glusterfs/3.4.0/xlator/cluster/distribute.so(dht_inode_ctx_layout_set+0x52) [0x7f541cd49382]))) 0-tst_vol_aws01-dht: invalid argument: inode
[2013-07-25 07:09:44.893476] W [page.c:991:__ioc_page_error] 0-tst_vol_aws01-io-cache: page error for page = 0x7f54100556c0 & waitq = 0x7f5410048b50
[2013-07-25 07:09:44.893527] W [fuse-bridge.c:2049:fuse_readv_cbk] 0-glusterfs-fuse: 5663: READ => -1 (Invalid argument)
[2013-07-25 07:09:44.896395] W [fuse-bridge.c:875:fuse_fd_cbk] 0-glusterfs-fuse: 5665: OPEN() /sat_p_cache/9_158_309/18/169/80930_158240 => -1 (Invalid argument)
[2013-07-25 07:09:44.896524] W [defaults.c:1291:default_release] (-->/usr/lib64/glusterfs/3.4.0/xlator/performance/open-behind.so(open_and_resume+0xa8) [0x7f541c30c288] (-->/usr/lib64/libglusterfs.so.0(call_stub_destroy+0x142) [0x7f54231cbbb2] (-->/usr/lib64/libglusterfs.so.0(fd_unref+0x144) [0x7f54231d8604]))) 0-fuse: xlator does not implement release_cbk
[2013-07-25 07:09:45.381747] I [afr-self-heal-common.c:1213:afr_sh_missing_entry_call_impunge_recreate] 0-tst_vol_aws01-replicate-0: no missing files - /sat_p_cache/9_158_309/19/674/161796_316486. proceeding to metadata check

I think the one causing the message to us is this "invalid argument: inode", but also the other are not exactly encouraging.  I can easily reproduce the problem without our service by this simple shell script:

#!/bin/sh

ip="`/sbin/ip a s eth0 |sed -n '/inet /{s%^.*inet %%;s%/.*$%%;p}'`"

tmpdir="/mnt/tst_vol_aws/tmp/$ip/foo/"
dstdir="/mnt/tst_vol_aws/sat_p_cache/foo/"

while true
do
	date >"$tmpdir/foo.$$"
	mv -v "$tmpdir/foo.$$" "$dstdir/foo.`date +%s`"
done

The clients get alternating:

mv: accessing `/mnt/tst_vol_aws/sat_p_cache/foo//foo.1374736742': Structure needs cleaning
mv: accessing `/mnt/tst_vol_aws/sat_p_cache/foo//foo.1374736742': Structure needs cleaning
mv: accessing `/mnt/tst_vol_aws/sat_p_cache/foo//foo.1374736742': Structure needs cleaning
mv: accessing `/mnt/tst_vol_aws/sat_p_cache/foo//foo.1374736742': Structure needs cleaning
mv: accessing `/mnt/tst_vol_aws/sat_p_cache/foo//foo.1374736742': Structure needs cleaning
mv: accessing `/mnt/tst_vol_aws/sat_p_cache/foo//foo.1374736742': Structure needs cleaning
mv: accessing `/mnt/tst_vol_aws/sat_p_cache/foo//foo.1374736742': Structure needs cleaning

errors and I get the same garbage in gluster logfiles:

[2013-07-25 07:19:00.018670] E [afr-self-heal-common.c:2212:afr_self_heal_completion_cbk] 0-tst_vol_aws01-replicate-0: background  meta-data data entry missing-entry gfid self-heal failed on /sat_p_cache/foo/foo.1374736740
[2013-07-25 07:19:00.019713] W [client-rpc-fops.c:259:client3_3_mknod_cbk] 0-tst_vol_aws01-client-0: remote operation failed: File exists. Path: /sat_p_cache/foo/foo.1374736740 (00000000-0000-0000-0000-000000000000)
[2013-07-25 07:19:00.020364] W [client-rpc-fops.c:259:client3_3_mknod_cbk] 0-tst_vol_aws01-client-1: remote operation failed: File exists. Path: /sat_p_cache/foo/foo.1374736740 (00000000-0000-0000-0000-000000000000)
[2013-07-25 07:19:00.020440] E [dht-helper.c:1052:dht_inode_ctx_get] (-->/usr/lib64/glusterfs/3.4.0/xlator/cluster/distribute.so(dht_lookup_linkfile_create_cbk+0x75) [0x7f73242c0025] (-->/usr/lib64/glusterfs/3.4.0/xlator/cluster/distribute.so(dht_layout_preset+0x5e) [0x7f73242a701e] (-->/usr/lib64/glusterfs/3.4.0/xlator/cluster/distribute.so(dht_inode_ctx_layout_set+0x34) [0x7f73242a8364]))) 0-tst_vol_aws01-dht: invalid argument: inode
[2013-07-25 07:19:00.020472] E [dht-helper.c:1071:dht_inode_ctx_set] (-->/usr/lib64/glusterfs/3.4.0/xlator/cluster/distribute.so(dht_lookup_linkfile_create_cbk+0x75) [0x7f73242c0025] (-->/usr/lib64/glusterfs/3.4.0/xlator/cluster/distribute.so(dht_layout_preset+0x5e) [0x7f73242a701e] (-->/usr/lib64/glusterfs/3.4.0/xlator/cluster/distribute.so(dht_inode_ctx_layout_set+0x52) [0x7f73242a8382]))) 0-tst_vol_aws01-dht: invalid argument: inode
[2013-07-25 07:19:00.172323] I [afr-self-heal-entry.c:2253:afr_sh_entry_fix] 0-tst_vol_aws01-replicate-0: /sat_p_cache/foo: Performing conservative merge
[2013-07-25 07:19:01.068257] W [client-rpc-fops.c:526:client3_3_stat_cbk] 0-tst_vol_aws01-client-0: remote operation failed: No such file or directory

Any advice on this?  What am I doing wrong?  If required I can provide an easy shell script how to set this up in AWS to reproduce the problem.  Thanks a lot in advance.

Kind regards,
 Frank.