Re: Broken status, peer probe, "DNS resolution failed on host" and "Error disabling sockopt IPV6_V6ONLY: "Protocol not available" after updating from gluster 7.9 to 9.1

Strahil Nikolov <hunter86_bg@xxxxxxxxx> · Sat, 24 Jul 2021 02:08:57 +0000 (UTC)

Can you try setting "transport.address-family: inet" at /etc/glusterfs/glusterd.vol on all nodes ?

About the rpms, if they are not yet built - the only other option is to build them from source.

I assume , that the second try is on a fresh set of systems without any remnants of old Gluster install.

Best Regards,
Strahil Nikolov

В петък, 23 юли 2021 г., 07:55:01 ч. Гринуич+3, Artem Russakovskii <archon810@xxxxxxxxx> написа: 

Hi Strahil,

I am using repo builds from https://download.opensuse.org/repositories/filesystems/openSUSE_Leap_15.2/x86_64/ (currently glusterfs-9.1-lp152.88.2.x86_64.rpm) and don't build them.

Perhaps the builds at https://download.opensuse.org/repositories/home:/glusterfs:/Leap15.2-9/openSUSE_Leap_15.2/x86_64/ are better (currently glusterfs-9.1-lp152.112.1.x86_64.rpm), does anyone know?

None of the repos currently have 9.3.

And regardless, I don't care for gluster using IPv6 if IPv4 works fine. Is there a way to make it stop trying to use IPv6 and only use IPv4?

Sincerely,
Artem

--
Founder, Android Police, APK Mirror, Illogical Robot LLC
beerpla.net | @ArtemR

On Thu, Jul 22, 2021 at 9:09 PM Strahil Nikolov <hunter86_bg@xxxxxxxxx> wrote:
> Did you try with latest 9.X ? Based on the release notes that should be 9.3 .
> 
> Best Regards,
> Strahil Nikolov
> 
> 
>>  
>>  
>> On Fri, Jul 23, 2021 at 3:06, Artem Russakovskii
>> <archon810@xxxxxxxxx> wrote:
>> 
>> 
>>  
>> Hi all,
>> 
>> I just filed this ticket https://github.com/gluster/glusterfs/issues/2648, and wanted to bring it to your attention. Any feedback would be appreciated.
>> 
>> Description of problem:
>> We have a 4-node replicate cluster running gluster 7.9. I'm currently setting up a new cluster on a new set of machines and went straight for gluster 9.1.
>> However, I was unable to probe any servers due to this error:
>> [2021-07-17 00:31:05.228609 +0000] I [MSGID: 106487] [glusterd-handler.c:1160:__glusterd_handle_cli_probe] 0-glusterd: Received CLI probe req nexus2 24007
>> [2021-07-17 00:31:05.229727 +0000] E [MSGID: 101075] [common-utils.c:3657:gf_is_local_addr] 0-management: error in getaddrinfo [{ret=Name or service not known}]
>> [2021-07-17 00:31:05.230785 +0000] E [MSGID: 106408] [glusterd-peer-utils.c:217:glusterd_peerinfo_find_by_hostname] 0-management: error in getaddrinfo: Name or service not known
>>  [Unknown error -2]
>> [2021-07-17 00:31:05.353971 +0000] I [MSGID: 106128] [glusterd-handler.c:3719:glusterd_probe_begin] 0-glusterd: Unable to find peerinfo for host: nexus2 (24007)
>> [2021-07-17 00:31:05.375871 +0000] W [MSGID: 106061] [glusterd-handler.c:3488:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout
>> [2021-07-17 00:31:05.375903 +0000] I [rpc-clnt.c:1010:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
>> [2021-07-17 00:31:05.377021 +0000] E [MSGID: 101075] [common-utils.c:520:gf_resolve_ip6] 0-resolver: error in getaddrinfo [{family=10}, {ret=Name or service not known}]
>> [2021-07-17 00:31:05.377043 +0000] E [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host nexus2
>> [2021-07-17 00:31:05.377147 +0000] I [MSGID: 106498] [glusterd-handler.c:3648:glusterd_friend_add] 0-management: connect returned 0
>> [2021-07-17 00:31:05.377201 +0000] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer <nexus2> (<00000000-0000-0000-0000-000000000000>), in state <Establishing Connection>, has disconnected from glusterd.
>> [2021-07-17 00:31:05.377453 +0000] E [MSGID: 101032] [store.c:464:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info. [No such file or directory]
>> 
>> I then wiped the /var/lib/glusterd dir to start clean and downgraded to 7.9, then attempted to peer probe again. This time, it worked fine, proving 7.9 is working, same as it is on prod.
>> At this point, I made a volume, started it, and played around with testing to my satisfaction. Then I decided to see what would happen if I tried to upgrade this working volume from 7.9 to 9.1.
>> The end result is:
>>     * gluster volume status is only showing the local gluster node and not any of the remote nodes
>>     * data does seem to replicate, so the connection between the servers is actually established
>>     * logs are now filled with constantly repeating messages like so:
>> [2021-07-22 23:29:31.039004 +0000] E [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host nexus2
>> [2021-07-22 23:29:31.039212 +0000] E [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host citadel
>> [2021-07-22 23:29:31.039304 +0000] E [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host hive
>> The message "E [MSGID: 101075] [common-utils.c:520:gf_resolve_ip6] 0-resolver: error in getaddrinfo [{family=10}, {ret=Name or service not known}]" repeated 119 times between [2021-07-22 23:27:34.025983 +0000] and [2021-07-22 23:29:31.039302 +0000]
>> [2021-07-22 23:29:34.039369 +0000] E [MSGID: 101075] [common-utils.c:520:gf_resolve_ip6] 0-resolver: error in getaddrinfo [{family=10}, {ret=Name or service not known}]
>> [2021-07-22 23:29:34.039441 +0000] E [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host nexus2
>> [2021-07-22 23:29:34.039558 +0000] E [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host citadel
>> [2021-07-22 23:29:34.039659 +0000] E [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host hive
>> [2021-07-22 23:29:37.039741 +0000] E [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host nexus2
>> [2021-07-22 23:29:37.039921 +0000] E [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host citadel
>> [2021-07-22 23:29:37.040015 +0000] E [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host hive
>> 
>> When I issue a command in cli:
>> ==> cli.log <==
>> [2021-07-22 23:38:11.802596 +0000] I [cli.c:840:main] 0-cli: Started running gluster with version 9.1
>> **[2021-07-22 23:38:11.804007 +0000] W [socket.c:3434:socket_connect] 0-glusterfs: Error disabling sockopt IPV6_V6ONLY: "Operation not supported"**
>> [2021-07-22 23:38:11.906865 +0000] I [MSGID: 101190] [event-epoll.c:670:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=0}]
>> 
>> **Mandatory info:** **- The output of the `gluster volume info` command**:
>> gluster volume info
>>  
>> Volume Name: ap
>> Type: Replicate
>> Volume ID: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>> Status: Started
>> Snapshot Count: 0
>> Number of Bricks: 1 x 4 = 4
>> Transport-type: tcp
>> Bricks:
>> Brick1: nexus2:/mnt/nexus2_block1/ap
>> Brick2: forge:/mnt/forge_block1/ap
>> Brick3: hive:/mnt/hive_block1/ap
>> Brick4: citadel:/mnt/citadel_block1/ap
>> Options Reconfigured:
>> performance.client-io-threads: on
>> nfs.disable: on
>> storage.fips-mode-rchecksum: on
>> transport.address-family: inet
>> cluster.self-heal-daemon: enable
>> client.event-threads: 4
>> cluster.data-self-heal-algorithm: full
>> cluster.lookup-optimize: on
>> cluster.quorum-count: 1
>> cluster.quorum-type: fixed
>> cluster.readdir-optimize: on
>> cluster.heal-timeout: 1800
>> disperse.eager-lock: on
>> features.cache-invalidation: on
>> features.cache-invalidation-timeout: 600
>> network.inode-lru-limit: 500000
>> network.ping-timeout: 7
>> network.remote-dio: enable
>> performance.cache-invalidation: on
>> performance.cache-size: 1GB
>> performance.io-thread-count: 4
>> performance.md-cache-timeout: 600
>> performance.rda-cache-limit: 256MB
>> performance.read-ahead: off
>> performance.readdir-ahead: on
>> performance.stat-prefetch: on
>> performance.write-behind-window-size: 32MB
>> server.event-threads: 4
>> cluster.background-self-heal-count: 1
>> performance.cache-refresh-timeout: 10
>> features.ctime: off
>> cluster.granular-entry-heal: enable
>> 
>> - The output of the gluster volume status command:
>> gluster volume status
>> Status of volume: ap
>> Gluster process                             TCP Port  RDMA Port  Online  Pid
>> ------------------------------------------------------------------------------
>> Brick forge:/mnt/forge_block1/ap            49152     0          Y       2622 
>> Self-heal Daemon on localhost               N/A       N/A        N       N/A  
>>  
>> Task Status of Volume ap
>> ------------------------------------------------------------------------------
>> There are no active volume tasks
>> 
>> - The output of the gluster volume heal command:
>> gluster volume heal ap enable
>> Enable heal on volume ap has been successful 
>> 
>> gluster volume heal ap 
>> Launching heal operation to perform index self heal on volume ap has been unsuccessful:
>> Self-heal daemon is not running. Check self-heal daemon log file.
>> 
>> - The operating system / glusterfs version:
>> OpenSUSE 15.2, glusterfs 9.1.
>> 
>> 
>> Sincerely,
>> Artem
>> 
>> --
>> Founder, Android Police, APK Mirror, Illogical Robot LLC
>> beerpla.net | @ArtemR
>> 
>> ________
>> 
>> 
>> 
>> Community Meeting Calendar:
>> 
>> Schedule -
>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>> Bridge: https://meet.google.com/cpu-eiue-hvk
>> Gluster-users mailing list
>> Gluster-users@xxxxxxxxxxx
>> https://lists.gluster.org/mailman/listinfo/gluster-users
>> 
>> 

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users