Re: Spurious failures because of nfs and snapshots

Vijaikumar M <vmallika@xxxxxxxxxx> · Wed, 21 May 2014 20:50:58 +0530

    KP, Atin and myself did some debugging and found that there was
      a deadlock in glusterd.

    When creating a volume snapshot, the back-end operation
      'taking a lvm_snapshot and starting brick' for the each brick

    are executed in parallel using synctask framework.

    brick_start was releasing a big_lock with brick_connect and
      does a lock again.

    This caused a deadlock in some race condition where
      main-thread waiting for one of the synctask thread to finish and

    synctask-thread waiting for the big_lock.

      We are working on fixing this issue.

    Thanks,

    Vijay

    On Wednesday 21 May 2014 12:23 PM,
      Vijaikumar M wrote:

      From the log:
        http://build.gluster.org:443/logs/glusterfs-logs-20140520%3a17%3a10%3a51.tgz
        it looks like glusterd was hung:

      Glusterd log:

         5305 [2014-05-20 20:08:55.040665] E
        [glusterd-snapshot.c:3805:glusterd_add_brick_to_snap_volume]
        0-management: Unable to fetch snap device
        (vol1.brick_snapdevice0). Leaving empty

       5306 [2014-05-20 20:08:55.649146] I
        [rpc-clnt.c:973:rpc_clnt_connection_init] 0-management: setting
        frame-timeout to
        600                                                                                                            

       5307 [2014-05-20
          20:08:55.663181] I
        [rpc-clnt.c:973:rpc_clnt_connection_init] 0-management: setting
        frame-timeout to 600

       5308 [2014-05-20
          20:16:55.541197] W
        [glusterfsd.c:1182:cleanup_and_exit] (--> 0-: received signum
        (15), shutting down

      Glusterd was hung when executing the testcase ./tests/bugs/bug-1090042.t.

      Cli log:

        72649 [2014-05-20 20:12:51.960765] T
        [rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting
        reconnect

       72650 [2014-05-20 20:12:51.960850] T
        [socket.c:2689:socket_connect]
        (-->/build/install/lib/libglusterfs.so.0(gf_timer_proc+0x1a2)
        [0x7ff8b6609994]
        (-->/build/install/lib/libgfrpc.so.0(rpc_clnt_reconnect+0x137)
        [0x7ff8b5d3305b] (-      
        ->/build/install/lib/libgfrpc.so.0(rpc_transport_connect+0x74)
        [0x7ff8b5d30071]))) 0-glusterfs: connect () called on transport
        already connected

       72651 [2014-05-20 20:12:52.960943] T
        [rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting
        reconnect

       72652 [2014-05-20 20:12:52.960999] T
        [socket.c:2697:socket_connect] 0-glusterfs: connecting
        0x1e0fcc0, state=0 gen=0 sock=-1

       72653 [2014-05-20 20:12:52.961038] W
        [dict.c:1059:data_to_str]
        (-->/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(+0xb5f3)

        [0x7ff8ad9e95f3]
        (-->/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(socket_clien      

        t_get_remote_sockaddr+0x10a) [0x7ff8ad9ed568]
        (-->/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(client_fill_address_family+0xf1)

        [0x7ff8ad9ec7d0]))) 0-dict: data is NULL

       72654 [2014-05-20 20:12:52.961070] W
        [dict.c:1059:data_to_str]
        (-->/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(+0xb5f3)

        [0x7ff8ad9e95f3]
        (-->/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(socket_clien      

        t_get_remote_sockaddr+0x10a) [0x7ff8ad9ed568]
        (-->/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(client_fill_address_family+0x100)

        [0x7ff8ad9ec7df]))) 0-dict: data is NULL

       72655 [2014-05-20 20:12:52.961079] E
        [name.c:140:client_fill_address_family] 0-glusterfs:
        transport.address-family not specified. Could not guess default
        value from (remote-host:(null) or
        transport.unix.connect-path:(null)) optio       ns

       72656 [2014-05-20 20:12:54.961273] T
        [rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting
        reconnect

       72657 [2014-05-20 20:12:54.961404] T
        [socket.c:2689:socket_connect]
        (-->/build/install/lib/libglusterfs.so.0(gf_timer_proc+0x1a2)
        [0x7ff8b6609994]
        (-->/build/install/lib/libgfrpc.so.0(rpc_clnt_reconnect+0x137)
        [0x7ff8b5d3305b] (-      
        ->/build/install/lib/libgfrpc.so.0(rpc_transport_connect+0x74)
        [0x7ff8b5d30071]))) 0-glusterfs: connect () called on transport
        already connected

       72658 [2014-05-20 20:12:55.120645]
          D [cli-cmd.c:384:cli_cmd_submit] 0-cli: Returning 110

           72659 [2014-05-20 20:12:55.120723] D
          [cli-rpc-ops.c:8716:gf_cli_snapshot] 0-cli: Returning 110

        Now we need to find why glusterd was hung.

      Thanks,

      Vijay

      On Wednesday 21 May 2014 06:46 AM,
        Pranith Kumar Karampuri wrote:

        Hey,
    Seems like even after this fix is merged, the regression tests are failing for the same script. You can check the logs at http://build.gluster.org:443/logs/glusterfs-logs-20140520%3a14%3a06%3a46.tgz

Relevant logs:
[2014-05-20 20:17:07.026045]  : volume create patchy build.gluster.org:/d/backends/patchy1 build.gluster.org:/d/backends/patchy2 : SUCCESS
[2014-05-20 20:17:08.030673]  : volume start patchy : SUCCESS
[2014-05-20 20:17:08.279148]  : volume barrier patchy enable : SUCCESS
[2014-05-20 20:17:08.476785]  : volume barrier patchy enable : FAILED : Failed to reconfigure barrier.
[2014-05-20 20:17:08.727429]  : volume barrier patchy disable : SUCCESS
[2014-05-20 20:17:08.926995]  : volume barrier patchy disable : FAILED : Failed to reconfigure barrier.

Pranith

----- Original Message -----

          From: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx>
To: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>
Cc: "Joseph Fernandes" <josferna@xxxxxxxxxx>, "Vijaikumar M" <vmallika@xxxxxxxxxx>
Sent: Tuesday, May 20, 2014 3:41:11 PM
Subject: Re: Spurious failures because of nfs and snapshots

hi,
    Please resubmit the patches on top of http://review.gluster.com/#/c/7753
    to prevent frequent regression failures.

Pranith
----- Original Message -----

            From: "Vijaikumar M" <vmallika@xxxxxxxxxx>
To: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx>
Cc: "Joseph Fernandes" <josferna@xxxxxxxxxx>, "Gluster Devel"
<gluster-devel@xxxxxxxxxxx>
Sent: Monday, May 19, 2014 2:40:47 PM
Subject: Re: Spurious failures because of nfs and snapshots

Brick disconnected with ping-time out:

Here is the log message
[2014-05-19 04:29:38.133266] I [MSGID: 100030] [glusterfsd.c:1998:main]
0-/build/install/sbin/glusterfsd: Started running /build/install/sbi
n/glusterfsd version 3.5qa2 (args: /build/install/sbin/glusterfsd -s
build.gluster.org --volfile-id /snaps/patchy_snap1/3f2ae3fbb4a74587b1a9
1013f07d327f.build.gluster.org.var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3
-p /var/lib/glusterd/snaps/patchy_snap1/3f2ae3f
bb4a74587b1a91013f07d327f/run/build.gluster.org-var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3.pid
-S /var/run/51fe50a6faf0aa    e006c815da946caf3a.socket --brick-name
/var/run/gluster/snaps/3f2ae3fbb4a74587b1a91013f07d327f/brick3 -l
/build/install/var/log/glusterfs/br
icks/var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3.log
--xlator-option *-posix.glusterd-uuid=494ef3cd-15fc-4c8c-8751-2d441ba
7b4b0 --brick-port 49164 --xlator-option
3f2ae3fbb4a74587b1a91013f07d327f-server.listen-port=49164)
   2 [2014-05-19 04:29:38.141118] I
[rpc-clnt.c:988:rpc_clnt_connection_init] 0-glusterfs: defaulting
ping-timeout to 30secs
   3 [2014-05-19 04:30:09.139521] C
[rpc-clnt-ping.c:105:rpc_clnt_ping_timer_expired] 0-glusterfs: server
10.3.129.13:24007 has not responded in the last 30 seconds, disconnecting.

Patch 'http://review.gluster.org/#/c/7753/' will fix the problem, where
ping-timer will be disabled by default for all the rpc connection except
for glusterd-glusterd (set to 30sec) and client-glusterd (set to 42sec).

Thanks,
Vijay

On Monday 19 May 2014 11:56 AM, Pranith Kumar Karampuri wrote:

              The latest build failure also has the same issue:
Download it from here:
http://build.gluster.org:443/logs/glusterfs-logs-20140518%3a22%3a27%3a31.tgz

Pranith

----- Original Message -----

                From: "Vijaikumar M" <vmallika@xxxxxxxxxx>
To: "Joseph Fernandes" <josferna@xxxxxxxxxx>
Cc: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx>, "Gluster Devel"
<gluster-devel@xxxxxxxxxxx>
Sent: Monday, 19 May, 2014 11:41:28 AM
Subject: Re: Spurious failures because of nfs and snapshots

Hi Joseph,

In the log mentioned below, it say ping-time is set to default value
30sec.I think issue is different.
Can you please point me to the logs where you where able to re-create
the problem.

Thanks,
Vijay

On Monday 19 May 2014 09:39 AM, Pranith Kumar Karampuri wrote:

                  hi Vijai, Joseph,
      In 2 of the last 3 build failures,
      http://build.gluster.org/job/regression/4479/console,
      http://build.gluster.org/job/regression/4478/console this
      test(tests/bugs/bug-1090042.t) failed. Do you guys think it is
      better
      to revert this test until the fix is available? Please send a
      patch
      to revert the test case if you guys feel so. You can re-submit it
      along with the fix to the bug mentioned by Joseph.

Pranith.

----- Original Message -----

                    From: "Joseph Fernandes" <josferna@xxxxxxxxxx>
To: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx>
Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>
Sent: Friday, 16 May, 2014 5:13:57 PM
Subject: Re: Spurious failures because of nfs and snapshots

Hi All,

tests/bugs/bug-1090042.t :

I was able to reproduce the issue i.e when this test is done in a loop

for i in {1..135} ; do  ./bugs/bug-1090042.t

When checked the logs
[2014-05-16 10:49:49.003978] I
[rpc-clnt.c:973:rpc_clnt_connection_init]
0-management: setting frame-timeout to 600
[2014-05-16 10:49:49.004035] I
[rpc-clnt.c:988:rpc_clnt_connection_init]
0-management: defaulting ping-timeout to 30secs
[2014-05-16 10:49:49.004303] I
[rpc-clnt.c:973:rpc_clnt_connection_init]
0-management: setting frame-timeout to 600
[2014-05-16 10:49:49.004340] I
[rpc-clnt.c:988:rpc_clnt_connection_init]
0-management: defaulting ping-timeout to 30secs

The issue is with ping-timeout and is tracked under the bug

https://bugzilla.redhat.com/show_bug.cgi?id=1096729

The workaround is mentioned in
https://bugzilla.redhat.com/show_bug.cgi?id=1096729#c8

Regards,
Joe

----- Original Message -----
From: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx>
To: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>
Cc: "Joseph Fernandes" <josferna@xxxxxxxxxx>
Sent: Friday, May 16, 2014 6:19:54 AM
Subject: Spurious failures because of nfs and snapshots

hi,
      In the latest build I fired for review.gluster.com/7766
      (http://build.gluster.org/job/regression/4443/console) failed
      because
      of
      spurious failure. The script doesn't wait for nfs export to be
      available. I fixed that, but interestingly I found quite a few
      scripts
      with same problem. Some of the scripts are relying on 'sleep 5'
      which
      also could lead to spurious failures if the export is not
      available
      in 5
      seconds. We found that waiting for 20 seconds is better, but
      'sleep
      20'
      would unnecessarily delay the build execution. So if you guys
      are
      going
      to write any scripts which has to do nfs mounts, please do it
      the
      following way:

EXPECT_WITHIN 20 "1" is_nfs_export_available;
TEST mount -t nfs -o vers=3 $H0:/$V0 $N0;

Please review http://review.gluster.com/7773 :-)

I saw one more spurious failure in a snapshot related script
tests/bugs/bug-1090042.t on the next build fired by Niels.
Joesph (CCed) is debugging it. He agreed to reply what he finds and
share
it
with us so that we won't introduce similar bugs in future.

I encourage you guys to share what you fix to prevent spurious
failures
in
future.

Thanks
Pranith

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-devel