Re: NFS timeouts?

Yannick Perret <yannick.perret@xxxxxxxxxxxxx> · Thu, 1 Dec 2016 13:34:43 +0100



    Le 01/12/2016 à 13:12, Yannick Perret a
      écrit :

    
    Hello,
      

      I have a client machine that mounts as NFS a replicate x2 volume.
      Practicaly this is configured with automount such as:
      

      DIR-NAME -rw,soft,intr server1,server2:/VOLUME
      

      Gluster servers are using 3.6.7.
      

      Sometimes the NFS blocks on client with
      

      server server2 not responding, timed out  (here it was connected
      on server2)
      

      but network communication is fine beetween the two machines (they
      are connected to the same switch, I can ssh on each, they ping
      each other…).
      

      I can also see few "xs_tcp_setup_socket: connect returned
      unhandled error -107" on the client.
      

      On 'server2' side I can see in the gluster nfs logs:
      

      [2016-12-01 10:50:15.887927] W [rpcsvc.c:261:rpcsvc_program_actor]
      0-rpc-service: RPC program version not available (req 100003 2)
      

      [2016-12-01 10:50:15.887965] E
      [rpcsvc.c:544:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor
      failed to complete successfully
      

      [2016-12-01 10:50:15.901880] W [rpcsvc.c:261:rpcsvc_program_actor]
      0-rpc-service: RPC program version not available (req 100003 4)
      

      [2016-12-01 10:50:15.901900] E
      [rpcsvc.c:544:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor
      failed to complete successfully
      

      [2016-12-01 10:51:03.777145] W [rpcsvc.c:261:rpcsvc_program_actor]
      0-rpc-service: RPC program version not available (req 100003 2)
      

      [2016-12-01 10:51:03.777191] E
      [rpcsvc.c:544:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor
      failed to complete successfully
      

      [2016-12-01 10:51:03.790561] W [rpcsvc.c:261:rpcsvc_program_actor]
      0-rpc-service: RPC program version not available (req 100003 4)
      

      [2016-12-01 10:51:03.790580] E
      [rpcsvc.c:544:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor
      failed to complete successfully
      

    It looks like these correspond to the NFS re-connection (client
    trying NFSv2 and NFSv4 I think).

    
    Just before that here are the logs:

    l_layout_new_directory] 0-HOME-LIRIS-dht: assigning range size
    0xffe76e40 to HOME-LIRIS-replicate-0

    [2016-12-01 10:48:36.990028] W
    [client-rpc-fops.c:2145:client3_3_setattr_cbk]
    0-HOME-LIRIS-client-1: remote operation failed: Opération non
    permise

    [2016-12-01 10:48:36.990303] W
    [client-rpc-fops.c:2145:client3_3_setattr_cbk]
    0-HOME-LIRIS-client-0: remote operation failed: Opération non
    permise

    The message "I [MSGID: 109036]
    [dht-common.c:6296:dht_log_new_layout_for_dir_selfheal]
    0-HOME-LIRIS-dht: Setting layout of
    <gfid:6f8bb427-eea5-4dd5-b004-9db8582bdda2>/_indexer.lock with
    [Subvol_name: HOME-LIRIS-replicate-0, Err: -1 , Start: 0 , Stop:
    4294967295 ], " repeated 2 times between [2016-12-01
    10:48:36.404738] and [2016-12-01 10:48:36.949907]

    [2016-12-01 10:48:36.990728] I [MSGID: 109036]
    [dht-common.c:6296:dht_log_new_layout_for_dir_selfheal]
    0-HOME-LIRIS-dht: Setting layout of
<gfid:6f8bb427-eea5-4dd5-b004-9db8582bdda2>/39132555496bb098708af2d5e7b56d67
    with [Subvol_name: HOME-LIRIS-replicate-0, Err: -1 , Start: 0 ,
    Stop: 4294967295 ], 

    [2016-12-01 10:50:10.360020] I [dht-rename.c:1344:dht_rename]
    0-HOME-LIRIS-dht: renaming
    <gfid:2a1f640e-ff3e-4a56-8019-64ec6d803fc1>/tmp_km1NUe
    (hash=HOME-LIRIS-replicate-0/cache=HOME-LIRIS-replicate-0) =>
    <gfid:2a1f640e-ff3e-4a56-8019-64ec6d803fc1>/general.php
    (hash=HOME-LIRIS-replicate-0/cache=HOME-LIRIS-replicate-0)

    [2016-12-01 10:50:10.423561] I [dht-rename.c:1344:dht_rename]
    0-HOME-LIRIS-dht: renaming
    <gfid:2a1f640e-ff3e-4a56-8019-64ec6d803fc1>/tmp_2pOZ5T
    (hash=HOME-LIRIS-replicate-0/cache=HOME-LIRIS-replicate-0) =>
    <gfid:2a1f640e-ff3e-4a56-8019-64ec6d803fc1>/1.php
    (hash=HOME-LIRIS-replicate-0/cache=HOME-LIRIS-replicate-0)

    [2016-12-01 10:50:10.485882] I [dht-rename.c:1344:dht_rename]
    0-HOME-LIRIS-dht: renaming
    <gfid:2a1f640e-ff3e-4a56-8019-64ec6d803fc1>/tmp_86Lmpz
    (hash=HOME-LIRIS-replicate-0/cache=HOME-LIRIS-replicate-0) =>
    <gfid:2a1f640e-ff3e-4a56-8019-64ec6d803fc1>/general.php
    (hash=HOME-LIRIS-replicate-0/cache=HOME-LIRIS-replicate-0)

    
    I also tried to set "nfs.mount-rmtab /dev/shm/glusterfs.rmtab" as I
    read on an old thread. Will check if it change something.

    
    Regards,

    --

    Y.

    
    at a time that correspond to the NFS timeouts.
      

      This problem occurs "often" (at least each day or each 2 days),
      and neither client nor servers are on heavy load (memory and CPU
      far to be full).
      

      Any idea about what can be the reason and how to prevent it to
      occur?
      

      I reduced the autofs timeout in order to reduce impact but it is
      not a very nice solution… Note: I can't use the glusterfs client
      instead of NFS because of the memory leaks that still exist in it.
      

      Thanks.
      

      Regards,
      

      --
      

      Y.
      

      _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users
    
    
Attachment:
smime.p7s

Description: Signature cryptographique S/MIME
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users