Update on georep failure

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

so we finally found the cause of the georep failure, after several days
of work from Deepshika and I. 

Short story:
============

side effect of adding libtirpc-devel on EL 7:
https://github.com/gluster/project-infrastructure/issues/115

Long story:
===========

So we first puzzled on why it was failing just on some builders and not
others, especially since it was working fine on softserve VMs. 

We tried to look for the usual suspect, rebooted, reinstalled, searched
if there was something weird (too much ssh keys, not enough inode, some
hardware issue), but nothing obvious. 

After trying to find my way in the logs file and a few weird leads
(like, why gsyncd was running gcc ? (answer: ctypes)), I was left with
a rather cryptic message:

[2021-02-02 15:19:00.040817 +0000] I
[socket.c:929:__socket_server_bind] 0-socket.gfchangelog: closing
(AF_UNIX) reuse check socket 18
[2021-02-02 15:19:02.041641 +0000] W [xdr-
rpcclnt.c:68:rpc_request_to_xdr] 0-rpc: failed to encode call msg
[2021-02-02 15:19:02.041673 +0000] E [rpc-
clnt.c:1507:rpc_clnt_record_build_record] 0-gfchangelog: Failed to
build record header
[2021-02-02 15:19:02.041683 +0000] W [rpc-clnt.c:1664:rpc_clnt_submit]
0-gfchangelog: cannot build rpc-record
[2021-02-02 15:19:02.041692 +0000] E [MSGID: 132023] [gf-
changelog.c:285:gf_changelog_setup_rpc] 0-gfchangelog: Could not
initiate probe RPC, bailing out!!! 
[2021-02-02 15:19:02.041809 +0000] E [MSGID: 132022] [gf-
changelog.c:583:gf_changelog_register_generic] 0-gfchangelog: Error
registering with changelog xlator 

Given that all gluster is around RPC, it would be unlikely that rpc is
broken, but that's the only messages we had.


We also found that the only builder that was working was builder 210.
Upon looking, we found that 210 failed to be updated with ansible, due
to some debugging we forgot to revert, which made this task fail: 
https://github.com/gluster/gluster.org_ansible_configuration/blob/master/roles/gluster_qa_scripts/tasks/main.yml#L7

But it wasn't clear how that would change anything, since the only diff
was a "set -e" that wasn't removed.

Then Deepshika started to test more than georep, and she noticed that a
lot of others tests were failing, with the same exact message about
rpc. 

And she started to wonder if anything was recently changed. And indeed:

# rpm -qa --last | head -n 15
yum-plugin-auto-update-debug-info-1.1.31-54.el7_8.noarch mar. 02 févr.
2021 14:04:59 UTC
python3-debuginfo-3.6.8-18.el7.x86_64         mar. 02 févr. 2021
14:04:58 UTC
glibc-debuginfo-2.17-317.el7.x86_64           mar. 02 févr. 2021
14:04:57 UTC
glibc-debuginfo-common-2.17-317.el7.x86_64    mar. 02 févr. 2021
14:04:53 UTC
gpg-pubkey-b6792c39-53c4fbdd                  mar. 02 févr. 2021
14:04:34 UTC
tzdata-java-2021a-1.el7.noarch                mer. 27 janv. 2021
09:09:27 UTC
tzdata-2021a-1.el7.noarch                     mer. 27 janv. 2021
09:09:26 UTC
sudo-1.8.23-10.el7_9.1.x86_64                 mer. 27 janv. 2021
09:09:26 UTC
libtirpc-devel-0.2.4-0.16.el7.x86_64          mar. 26 janv. 2021
12:53:45 UTC
java-1.8.0-openjdk-1.8.0.282.b08-1.el7_9.x86_64 mar. 26 janv. 2021
05:06:44 UTC

We added libtirpc-devel on the 26/01.

libtirpc-devel would, as the name imply, change something around the
rpc subsystem.

It happened around last week, when we started to notice the problem.

It was not applied to 210, because 210 failed before it got to that
point (since ansible stop as soon as the git update failed, and jenkins
builder role is after the gluster-qa-script update).

It was not applied to softserve provided VM either, so tests where
working fine there. 

And indeed, once the package got removed, the tests were working again.

Follow up
=========

So, I would like to know exactly what should be tested. Is gluster not
compatible with libtirpc on C7 (as it work on C8), or is there some
weird issue ? (cause from what I remember, RPC format is supposed to be
compatible and covered by a specification)

Should we test on C8 only ?


-- 
Michael Scherer / He/Il/Er/Él
Sysadmin, Community Infrastructure



Attachment: signature.asc
Description: This is a digitally signed message part

-------

Community Meeting Calendar:
Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk

Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel


[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux