Re: Need help with https://bugzilla.redhat.com/show_bug.cgi?id=1224180

Xavier Hernandez <xhernandez@xxxxxxxxxx> · Tue, 13 Sep 2016 10:09:50 +0200

Hi Sanoj,

On 13/09/16 09:41, Sanoj Unnikrishnan wrote:
Hi Xavi,

That explains a lot,
I see a couple of other scenario which can lead to similar inconsistency.
1) simultaneous node/brick crash of 3 bricks.

Although this is a real problem, the 3 bricks should crash exactly at 
the same moment just after having successfully locked the inode being 
modified and queried some information, but before sending the write fop 
nor any down notification. The probability to have this suffer this 
problem is really small.

2) if the disk space of underlying filesystem on which brick is hosted exceeds for 3 bricks.

Yes. This is the same cause that makes quota fail.

I don't think we can address all the scenario unless we have a log/journal mechanism like raid-5.

I completely agree. I don't see any solution valid for all cases. BTW 
RAID-5 *is not* a solution. It doesn't have any log/journal. Maybe 
something based on fdl xlator would work.

Should we look at a quota specific fix or let it get fixed whenever we introduce a log?

Not sure how to fix this in a way that doesn't seem too hacky.

One possibility is to request permission to write some data before 
actually writing it (specifying offset and size). And then be sure that 
the write will succeed if all (or at least the minimum number of data 
bricks) has acknowledged the previous write permission request.

Another approach would be to queue writes in a server side xlator until 
a commit message is received, but sending back an answer saying if 
there's enough space to do the write (this is, in some way, a very 
primitive log/journal approach).

However both approaches will have a big performance impact if they 
cannot be executed in background.

Maybe it would be worth investing in fdl instead of trying to find a 
custom solution to this.

Xavi

Thanks and Regards,
Sanoj

----- Original Message -----
From: "Xavier Hernandez" <xhernandez@xxxxxxxxxx>
To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>, "Sanoj Unnikrishnan" <sunnikri@xxxxxxxxxx>
Cc: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx>, "Ashish Pandey" <aspandey@xxxxxxxxxx>, "Gluster Devel" <gluster-devel@xxxxxxxxxxx>
Sent: Tuesday, September 13, 2016 11:50:27 AM
Subject: Re: Need help with https://bugzilla.redhat.com/show_bug.cgi?id=1224180

Hi Sanoj,

I'm unable to see bug 1224180. Access is restricted.

Not sure what is the problem exactly, but I see that quota is involved.
Currently disperse doesn't play well with quota when the limit is near.

The reason is that not all bricks fail at the same time with EDQUOT due
to small differences is computed space. This causes a valid write to
succeed on some bricks and fail on others. If it fails simultaneously on
more than redundancy bricks but less that the number of data bricks,
there's no way to rollback the changes on the bricks that have
succeeded, so the operation is inconsistent and an I/O error is returned.

For example, on a 6:2 configuration (4 data bricks and 2 redundancy), if
3 bricks succeed and 3 fail, there are not enough bricks with the
updated version, but there aren't enough bricks with the old version either.

If you force 2 bricks to be down, the problem can appear more frequently
as only a single failure causes this problem.

Xavi

On 13/09/16 06:09, Raghavendra Gowdappa wrote:
+gluster-devel

----- Original Message -----
From: "Sanoj Unnikrishnan" <sunnikri@xxxxxxxxxx>
To: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx>, "Ashish Pandey" <aspandey@xxxxxxxxxx>, xhernandez@xxxxxxxxxx,
"Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>
Sent: Monday, September 12, 2016 7:06:59 PM
Subject: Need help with https://bugzilla.redhat.com/show_bug.cgi?id=1224180

Hello Xavi/Pranith,

I have been able to reproduce the BZ with the following steps:

gluster volume create v_disp disperse 6 redundancy 2 $tm1:/export/sdb/br1
$tm2:/export/sdb/b2 $tm3:/export/sdb/br3  $tm1:/export/sdb/b4
$tm2:/export/sdb/b5 $tm3:/export/sdb/b6 force
#(Used only 3 nodes, should not matter here)
gluster volume start v_disp
mount -t glusterfs $tm1:v_disp /gluster_vols/v_disp
mkdir /gluster_vols/v_disp/dir1
dd if=/dev/zero of=/gluster_vols/v_disp/dir1/x bs=10k count=90000 &
gluster v quota v_disp enable
gluster v quota v_disp limit-usage /dir1 200MB
gluster v quota v_disp soft-timeout 0
gluster v quota v_disp hard-timeout 0
#optional remove 2 bricks (reproduces more often with this)
#pgrep glusterfsd | xargs kill -9

IO error on stdout when Quota exceeds, followed by Disk Quota exceeded.

Also note the issue is seen when A flush happens simultaneous with quota
limit hit, Hence Its not seen only on some runs.

The following are the error in logs.
[2016-09-12 10:40:02.431568] E [MSGID: 122034]
[ec-common.c:488:ec_child_select] 0-v_disp-disperse-0: Insufficient
available childs for this request (have 0, need 4)
[2016-09-12 10:40:02.431627] E [MSGID: 122037]
[ec-common.c:1830:ec_update_size_version_done] 0-Disperse: sku-debug:
pre-version=0/0, size=0post-version=1865/1865, size=209571840
[2016-09-12 10:40:02.431637] E [MSGID: 122037]
[ec-common.c:1835:ec_update_size_version_done] 0-v_disp-disperse-0: Failed
to update version and size [Input/output error]
[2016-09-12 10:40:02.431664] E [MSGID: 122034]
[ec-common.c:417:ec_child_select] 0-v_disp-disperse-0: sku-debug: mask: 36,
ec->xl_up 36, ec->node_mask 3f, parent->mask:36, fop->parent->healing:0,
id:29

[2016-09-12 10:40:02.431673] E [MSGID: 122034]
[ec-common.c:480:ec_child_select] 0-v_disp-disperse-0: sku-debug: mask: 36,
remaining: 36, healing: 0, ec->xl_up 36, ec->node_mask 3f, parent->mask:36,
num:4, minimum: 1, id:29

...
[2016-09-12 10:40:02.487302] W [fuse-bridge.c:2311:fuse_writev_cbk]
0-glusterfs-fuse: 41159: WRITE => -1
gfid=ee0b4aa1-1f44-486a-883c-acddc13ee318 fd=0x7f1d9c003edc (Input/output
error)
[2016-09-12 10:40:02.500151] W [MSGID: 122006]
[ec-combine.c:206:ec_iatt_combine] 0-v_disp-disperse-0: Failed to combine
iatt (inode: 9816911356190712600-9816911356190712600, links: 1-1, uid: 0-0,
gid: 0-0, rdev: 0-0, size: 52423680-52413440, mode: 100644-100644)
[2016-09-12 10:40:02.500188] N [MSGID: 122029]
[ec-combine.c:93:ec_combine_write] 0-v_disp-disperse-0: Mismatching iatt in
answers of 'WRITE'
[2016-09-12 10:40:02.504551] W [MSGID: 122006]
[ec-combine.c:206:ec_iatt_combine] 0-v_disp-disperse-0: Failed to combine
iatt (inode: 9816911356190712600-9816911356190712600, links: 1-1, uid: 0-0,
gid: 0-0, rdev: 0-0, size: 52423680-52413440, mode: 100644-100644)
....
....

[2016-09-12 10:40:02.571272] N [MSGID: 122029]
[ec-combine.c:93:ec_combine_write] 0-v_disp-disperse-0: Mismatching iatt in
answers of 'WRITE'
[2016-09-12 10:40:02.571510] W [MSGID: 122006]
[ec-combine.c:206:ec_iatt_combine] 0-v_disp-disperse-0: Failed to combine
iatt (inode: 9816911356190712600-9816911356190712600, links: 1-1, uid: 0-0,
gid: 0-0, rdev: 0-0, size: 52423680-52413440, mode: 100644-100644)
[2016-09-12 10:40:02.571544] N [MSGID: 122029]
[ec-combine.c:93:ec_combine_write] 0-v_disp-disperse-0: Mismatching iatt in
answers of 'WRITE'
[2016-09-12 10:40:02.571772] W [fuse-bridge.c:1290:fuse_err_cbk]
0-glusterfs-fuse: 41160: FLUSH() ERR => -1 (Input/output error)

Also, for some fops before the write I noticed the fop->mask field as 0, Its
not clear why this happens ??

[2016-09-12 10:40:02.431561] E [MSGID: 122034]
[ec-common.c:480:ec_child_select] 0-v_disp-disperse-0: sku-debug: mask: 0,
remaining: 0, healing: 0, ec->xl_up 36, ec->node_mask 3f, parent->mask:36,
num:0, minimum: 4, fop->id:34
[2016-09-12 10:40:02.431568] E [MSGID: 122034]
[ec-common.c:488:ec_child_select] 0-v_disp-disperse-0: Insufficient
available childs for this request (have 0, need 4)
[2016-09-12 10:40:02.431637] E [MSGID: 122037]
[ec-common.c:1835:ec_update_size_version_done] 0-v_disp-disperse-0: Failed
to update version and size [Input/output error]

Is the zero value of fop->mask related to mismatch in iatt ?
Any scenario of race between write/flush fop?
please suggest how to proceed.

Thanks and Regards,
Sanoj

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel