Re: NetBSD regression tests hanging after ./tests/basic/mgmt_v3-locks.t

Rajesh Joseph <rjoseph@xxxxxxxxxx> · Mon, 15 Jun 2015 18:28:26 +0530

On Monday 15 June 2015 05:21 PM, Kaushal M wrote:
The hang we observe is not something specific to Gluster. I've
observed this kind of hangs when a filesystem which is in use goes
offline.
For example I've accidently shutdown machines which were being used
for mounting nfs, which lead to the client systems hanging completely
and required a hard reboot.

If there are ways to avoid these kinds hangs when they eventually
occur, I'm all ears.

For these test cases can't we use the nfs soft mount option to prevent 
the hang?

On Mon, Jun 15, 2015 at 4:38 PM, Pranith Kumar Karampuri
<pkarampu@xxxxxxxxxx> wrote:
Emmanuel,
        I am not sure of the feasibility but just wanted to ask you. Do you
think there is a possibility to error out operations on the mount when mount
crashes instead of hanging? That would prevent a lot of manual intervention
even in future.

Pranith.

On 06/15/2015 01:35 PM, Niels de Vos wrote:
Hi,

sometimes the NetBSD regression tests hang with messages like this:

      [12:29:07] ./tests/basic/mgmt_v3-locks.t
      ........................................... ok    79867 ms
      No volumes present
      mount_nfs: can't access /patchy: Permission denied
      mount_nfs: can't access /patchy: Permission denied
      mount_nfs: can't access /patchy: Permission denied

Most (if not all) of these hangs are caused by a crashing Gluster/NFS
process. Once the Gluster/NFS server is not reachable anymore,
unmounting fails.

The only way to recover is to reboot the VM and retrigger the test. For
rebooting, the http://build.gluster.org/job/reboot-vm job can be used,
and retriggering works by clicking the "retrigger" link in the left menu
once the test has been marked as failed/aborted.

When logging in on the NetBSD system that hangs, you can verify with
these steps:

1. check if there is a /glusterfsd.core file
2. run gdb on the core:

      # cd /build/install
      # gdb --core=/glusterfsd.core sbin/glusterfs
      ...
      Program terminated with signal SIGSEGV, Segmentation fault.
      #0  0xb9b94f0b in auth_cache_lookup (cache=0xb9aa2310, fh=0xb9044bf8,
      host_addr=0xb900e400 "104.130.205.187", timestamp=0xbf7fd900,
      can_write=0xbf7fd8fc)
          at

/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/xlators/nfs/server/src/auth-cache.c:164
      164             *can_write = lookup_res->item->opts->rw;

3. verify the lookup_res structure:

      (gdb) p *lookup_res
      $1 = {timestamp = 1434284981, item = 0xb901e3b0}
      (gdb) p *lookup_res->item
      $2 = {name = 0xffffff00 <error: Cannot access memory at address
      0xffffff00>, opts = 0xffffffff}

A fix for this has been sent, it is currently waiting for an update to
the prosed reference counting:

    - http://review.gluster.org/11022
      core: add "gf_ref_t" for common refcounting structures
    - http://review.gluster.org/11023
      nfs: refcount each auth_cache_entry and related data_t

Thanks,
Niels
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel