Re: NetBSD regression tests hanging after ./tests/basic/mgmt_v3-locks.t

Kaushal M <kshlmster@xxxxxxxxx> · Mon, 15 Jun 2015 17:21:08 +0530

The hang we observe is not something specific to Gluster. I've
observed this kind of hangs when a filesystem which is in use goes
offline.
For example I've accidently shutdown machines which were being used
for mounting nfs, which lead to the client systems hanging completely
and required a hard reboot.

If there are ways to avoid these kinds hangs when they eventually
occur, I'm all ears.

On Mon, Jun 15, 2015 at 4:38 PM, Pranith Kumar Karampuri
<pkarampu@xxxxxxxxxx> wrote:
> Emmanuel,
>        I am not sure of the feasibility but just wanted to ask you. Do you
> think there is a possibility to error out operations on the mount when mount
> crashes instead of hanging? That would prevent a lot of manual intervention
> even in future.
>
> Pranith.
>
> On 06/15/2015 01:35 PM, Niels de Vos wrote:
>>
>> Hi,
>>
>> sometimes the NetBSD regression tests hang with messages like this:
>>
>>      [12:29:07] ./tests/basic/mgmt_v3-locks.t
>>      ........................................... ok    79867 ms
>>      No volumes present
>>      mount_nfs: can't access /patchy: Permission denied
>>      mount_nfs: can't access /patchy: Permission denied
>>      mount_nfs: can't access /patchy: Permission denied
>>
>> Most (if not all) of these hangs are caused by a crashing Gluster/NFS
>> process. Once the Gluster/NFS server is not reachable anymore,
>> unmounting fails.
>>
>> The only way to recover is to reboot the VM and retrigger the test. For
>> rebooting, the http://build.gluster.org/job/reboot-vm job can be used,
>> and retriggering works by clicking the "retrigger" link in the left menu
>> once the test has been marked as failed/aborted.
>>
>> When logging in on the NetBSD system that hangs, you can verify with
>> these steps:
>>
>> 1. check if there is a /glusterfsd.core file
>> 2. run gdb on the core:
>>
>>      # cd /build/install
>>      # gdb --core=/glusterfsd.core sbin/glusterfs
>>      ...
>>      Program terminated with signal SIGSEGV, Segmentation fault.
>>      #0  0xb9b94f0b in auth_cache_lookup (cache=0xb9aa2310, fh=0xb9044bf8,
>>      host_addr=0xb900e400 "104.130.205.187", timestamp=0xbf7fd900,
>>      can_write=0xbf7fd8fc)
>>          at
>>
>> /home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/xlators/nfs/server/src/auth-cache.c:164
>>      164             *can_write = lookup_res->item->opts->rw;
>>
>> 3. verify the lookup_res structure:
>>
>>      (gdb) p *lookup_res
>>      $1 = {timestamp = 1434284981, item = 0xb901e3b0}
>>      (gdb) p *lookup_res->item
>>      $2 = {name = 0xffffff00 <error: Cannot access memory at address
>>      0xffffff00>, opts = 0xffffffff}
>>
>>
>> A fix for this has been sent, it is currently waiting for an update to
>> the prosed reference counting:
>>
>>    - http://review.gluster.org/11022
>>      core: add "gf_ref_t" for common refcounting structures
>>    - http://review.gluster.org/11023
>>      nfs: refcount each auth_cache_entry and related data_t
>>
>> Thanks,
>> Niels
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel@xxxxxxxxxxx
>> http://www.gluster.org/mailman/listinfo/gluster-devel
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxxx
> http://www.gluster.org/mailman/listinfo/gluster-devel
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel