Re: Need some help on Mismatching xdata / Failed combine iatt / Too many fd

Xavier Hernandez <xhernandez@xxxxxxxxxx> · Fri, 22 Apr 2016 09:12:47 +0200

Some time ago I saw an issue with Gluster-NFS combined with disperse 
under high write load. I thought that it was already solved, but this 
issue is very similar.

The problem seemed to be related to multithreaded epoll and throttling. 
For some reason NFS was sending a massive amount of requests, ignoring 
the throttling threshold. This caused the NFS connection to be 
unresponsive. This combined with a held lock at the time of the hung 
causes it to never be released, blocking other clients.

Maybe it's not related to this problem, but I though it could be 
important to consider it.

Xavi

On 22/04/16 08:19, Ashish Pandey wrote:

Hi Chen,

I thought I replied to your previous mail.
This issue has been faced by other users also. Serkan is the one if you
follow his mail on gluster-user.

I still have to dig further into it.  Soon we will try to reproduce it
and debug it.
My observation is that we face this issue while IO is going on and one
of the server gets disconnect and reconnects.
This incident might happen because of update or network issue.
But in any way we should not come to this situation.

I am adding Pranith  and Xavi who can address any unanswered queries and
explanation.

-----
Ashish

------------------------------------------------------------------------
*From: *"Chen Chen" <chenchen@xxxxxxxxxxxxxxxx>
*To: *"Joe Julian" <joe@xxxxxxxxxxxxxxxx>, "Ashish Pandey"
<aspandey@xxxxxxxxxx>
*Cc: *"Gluster Users" <gluster-users@xxxxxxxxxxx>
*Sent: *Friday, April 22, 2016 8:28:48 AM
*Subject: *Re:  Need some help on Mismatching xdata /
Failed combine iatt / Too many fd

Hi Ashish,

Are you still watching this thread? I got no response after I sent the
info you requested. Also, could anybody explain what heal-lock is doing?

I got another inode lock yesterday. Only one lock occured in the whole
12 bricks, yet it stopped the cluster from working again. None of my
peer's OS is frozen, and this time "start force" worked.

------
[xlator.features.locks.mainvol-locks.inode]
path=<gfid:2092ae08-81de-4717-a7d5-6ad955e18b58>/NTD/variants_calling/primary_gvcf/A2612/13.g.vcf
mandatory=0
inodelk-count=2
lock-dump.domain.domain=mainvol-disperse-0
inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=0, pid =
1, owner=dc3dbfac887f0000, client=0x7f649835adb0,
connection-id=hw10-6664-2016/04/17-14:47:58:6629-mainvol-client-0-0,
granted at 2016-04-21 11:45:30
inodelk.inodelk[1](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid =
1, owner=d433bfac887f0000, client=0x7f649835adb0,
connection-id=hw10-6664-2016/04/17-14:47:58:6629-mainvol-client-0-0,
blocked at 2016-04-21 11:45:33
------

I've also filed a bug report on bugzilla.
https://bugzilla.redhat.com/show_bug.cgi?id=1329466

Best regards,
Chen

On 4/13/2016 10:31 PM, Joe Julian wrote:
 >
 >
 > On 04/13/2016 03:29 AM, Ashish Pandey wrote:
 >> Hi Chen,
 >>
 >> What do you mean by "instantly get inode locked and teared down
 >> the whole cluster" ? Do you mean that whole disperse volume became
 >> unresponsive?
 >>
 >> I don't have much idea about features.lock-heal so can't comment how
 >> can it help you.
 >
 > So who should get added to this email that would have an idea? Let's get
 > that person looped in.
 >
 >>
 >> Could you please explain second part of your mail? What exactly are
 >> you trying to do and what is the setup?
 >> Also volume info, logs statedumps might help.
 >>
 >> -----
 >> Ashish
 >>
 >>
 >> ------------------------------------------------------------------------
 >> *From: *"Chen Chen" <chenchen@xxxxxxxxxxxxxxxx>
 >> *To: *"Ashish Pandey" <aspandey@xxxxxxxxxx>
 >> *Cc: *gluster-users@xxxxxxxxxxx
 >> *Sent: *Wednesday, April 13, 2016 3:26:53 PM
 >> *Subject: *Re:  Need some help on Mismatching xdata /
 >> Failed combine iatt / Too many fd
 >>
 >> Hi Ashish and other Gluster Users,
 >>
 >> When I put some heavy IO load onto my cluster (a rsync operation,
 >> ~600MB/s), one of the node instantly get inode locked and teared down
 >> the whole cluster. I've already turned on "features.lock-heal" but it
 >> didn't help.
 >>
 >> My clients is using a round-robin tactic to mount servers, hoping to
 >> average the pressure. Could it be caused by a race between NFS servers
 >> on different nodes? Should I instead create a dedicated NFS Server with
 >> huge memory, no brick, and multiple Ethernet cables?
 >>
 >> I really appreciate any help from you guys.
 >>
 >> Best wishes,
 >> Chen
 >>
 >> PS. Don't know why the native fuse client is 5 times inferior than the
 >> old good NFSv3.
 >>
 >> On 4/4/2016 6:11 PM, Ashish Pandey wrote:
 >> > Hi Chen,
 >> >
 >> > As I suspected, there are many blocked call for inodelk in
 >> sm11/mnt-disk1-mainvol.31115.dump.1459760675.
 >> >
 >> > =============================================
 >> > [xlator.features.locks.mainvol-locks.inode]
 >> > path=/home/analyzer/softs/bin/GenomeAnalysisTK.jar
 >> > mandatory=0
 >> > inodelk-count=4
 >> > lock-dump.domain.domain=mainvol-disperse-0:self-heal
 >> > lock-dump.domain.domain=mainvol-disperse-0
 >> > inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=0, pid
 >> = 1, owner=dc2d3dfcc57f0000, client=0x7ff03435d5f0,
 >> connection-id=sm12-8063-2016/04/01-07:51:46:892384-mainvol-client-0-0-0,
 >> blocked at 2016-04-01 16:52:58, granted at 2016-04-01 16:52:58
 >> > inodelk.inodelk[1](BLOCKED)=type=WRITE, whence=0, start=0, len=0,
 >> pid = 1, owner=1414371e1a7f0000, client=0x7ff034204490,
 >>
connection-id=hw10-17315-2016/04/01-07:51:44:421807-mainvol-client-0-0-0,
 >> blocked at 2016-04-01 16:58:51
 >> > inodelk.inodelk[2](BLOCKED)=type=WRITE, whence=0, start=0, len=0,
 >> pid = 1, owner=a8eb14cd9b7f0000, client=0x7ff01400dbd0,
 >>
connection-id=sm14-879-2016/04/01-07:51:56:133106-mainvol-client-0-0-0,
blocked
 >> at 2016-04-01 17:03:41
 >> > inodelk.inodelk[3](BLOCKED)=type=WRITE, whence=0, start=0, len=0,
 >> pid = 1, owner=b41a0482867f0000, client=0x7ff01800e670,
 >>
connection-id=sm15-30906-2016/04/01-07:51:45:711474-mainvol-client-0-0-0,
 >> blocked at 2016-04-01 17:05:09
 >> > =============================================
 >> >
 >> > This could be the cause of hang.
 >> > Possible Workaround -
 >> > If there is no IO going on for this volume, we can restart the
 >> volume using - gluster v start <volume-name> force. This will restart
 >> the nfs process too which will release the locks and
 >> > we could come out of this issue.
 >> >
 >> > Ashish
 >>
 >> --
 >> Chen Chen
 >> Shanghai SmartQuerier Biotechnology Co., Ltd.
 >> Add: Add: 3F, 1278 Keyuan Road, Shanghai 201203, P. R. China
 >> Mob: +86 15221885893
 >> Email: chenchen@xxxxxxxxxxxxxxxx
 >> Web: www.smartquerier.com
 >>
 >>
 >> _______________________________________________
 >> Gluster-users mailing list
 >> Gluster-users@xxxxxxxxxxx
 >> http://www.gluster.org/mailman/listinfo/gluster-users
 >>
 >>
 >>
 >> _______________________________________________
 >> Gluster-users mailing list
 >> Gluster-users@xxxxxxxxxxx
 >> http://www.gluster.org/mailman/listinfo/gluster-users
 >

--
Chen Chen
Shanghai SmartQuerier Biotechnology Co., Ltd.
Add: Add: 3F, 1278 Keyuan Road, Shanghai 201203, P. R. China
Mob: +86 15221885893
Email: chenchen@xxxxxxxxxxxxxxxx
Web: www.smartquerier.com

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users