On 31 July 2018 at 22:11, Atin Mukherjee <amukherj@xxxxxxxxxx> wrote:
I just went through the nightly regression report of brick mux runs and here's what I can summarize.
============================================================ ============================== ============================== ============================== ===================
Fails only with brick-mux
============================================================ ============================== ============================== ============================== ===================
tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even after 400 secs. Refer https://fstat.gluster.org/ failure/209?state=2&start_ , specifically the latest report https://build.gluster.org/job/date=2018-06-30&end_date=2018- 07-31&branch=all regression-test-burn-in/4051/ . Wasn't timing out as frequently as it was till 12 July. But since 27 July, it has timed out twice. Beginning to believe commit 9400b6f2c8aa219a493961e0ab9770consoleText b7f12e80d2 has added the delay and now 400 secs isn't sufficient enough (Mohit?)
One of the failed regression-test-burn in was an actual failure,not a timeout.
The brick disconnects from glusterd:
[2018-07-27 16:28:42.882668] I [MSGID: 106005] [glusterd-handler.c:6129:__glusterd_brick_rpc_notify] 0-management: Brick builder103.cloud.gluster.org:/d/backends/vol01/brick0 has disconnected from glusterd.
[2018-07-27 16:28:42.891031] I [MSGID: 106143] [glusterd-pmap.c:397:pmap_registry_remove] 0-pmap: removing brick /d/backends/vol01/brick0 on port 49152
[2018-07-27 16:28:42.892379] I [MSGID: 106143] [glusterd-pmap.c:397:pmap_registry_remove] 0-pmap: removing brick (null) on port 49152
[2018-07-27 16:29:02.636027]:++++++++++ G_LOG:./tests/bugs/core/bug-1432542-mpx-restart-crash.t: TEST: 56 _GFS --attribute-timeout=0 --entry-timeout=0 -s builder103.cloud.gluster.org --volfile-id=patchy-vol20 /mnt/glusterfs/vol20 ++++++++++
So the client cannot connect to the bricks after this as it never gets the port info from glusterd. From mnt-glusterfs-vol20.log:
[2018-07-27 16:29:02.769947] I [MSGID: 114020] [client.c:2329:notify] 0-patchy-vol20-client-1: parent translators are ready, attempting connect on transport
[2018-07-27 16:29:02.770677] E [MSGID: 114058] [client-handshake.c:1518:client_query_portmap_cbk] 0-patchy-vol20-client-0: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.
[2018-07-27 16:29:02.770767] I [MSGID: 114018] [client.c:2255:client_rpc_notify] 0-patchy-vol20-client-0: disconnected from patchy-vol20-client-0. Client process will keep trying to connect to glusterd until brick's port is available
From the brick logs:
[2018-07-27 16:28:34.729241] I [login.c:111:gf_auth] 0-auth/login: allowed user names: 2b65c380-392e-459f-b722-c130aac29377
[2018-07-27 16:28:34.945474] I [MSGID: 115029] [server-handshake.c:786:server_setvolume] 0-patchy-vol01-server: accepted client from CTX_ID:72dcd65e-2125-4a79-8331-48c0fe9abce7-GRAPH_ID:0-PID:8483-HOST:builder103.cloud.gluster.org-PC_NAME:patchy-vol06-client-2-RECON_NO:-0 (version: 4.2dev)
[2018-07-27 16:28:35.946588] I [MSGID: 101016] [glusterfs3.h:739:dict_to_xdr] 0-dict: key 'glusterfs.xattrop_index_gfid' is would not be sent on wire in future [Invalid argument] <--- Last Brick Log. It looks like the brick went down at this point.
[2018-07-27 16:29:02.636027]:++++++++++ G_LOG:./tests/bugs/core/bug-1432542-mpx-restart-crash.t: TEST: 56 _GFS --attribute-timeout=0 --entry-timeout=0 -s builder103.cloud.gluster.org --volfile-id=patchy-vol20 /mnt/glusterfs/vol20 ++++++++++
[2018-07-27 16:29:12.021827]:++++++++++ G_LOG:./tests/bugs/core/bug-1432542-mpx-restart-crash.t: TEST: 83 dd if=/dev/zero of=/mnt/glusterfs/vol20/a_file bs=4k count=1 ++++++++++
[2018-07-27 16:29:12.039248]:++++++++++ G_LOG:./tests/bugs/core/bug-1432542-mpx-restart-crash.t: TEST: 87 killall -9 glusterd ++++++++++
[2018-07-27 16:29:17.073995]:++++++++++ G_LOG:./tests/bugs/core/bug-1432542-mpx-restart-crash.t: TEST: 89 killall -9 glusterfsd ++++++++++
[2018-07-27 16:29:22.096385]:++++++++++ G_LOG:./tests/bugs/core/bug-1432542-mpx-restart-crash.t: TEST: 95 glusterd ++++++++++
[2018-07-27 16:29:24.481555] I [MSGID: 100030] [glusterfsd.c:2728:main] 0-/build/install/sbin/glusterfsd: Started running /build/install/sbin/glusterfsd version 4.2dev (args: /build/install/sbin/glusterfsd -s builder103.cloud.gluster.org --volfile-id patchy-vol01.builder103.cloud.gluster.org.d-backends-vol01-brick0 -p /var/run/gluster/vols/patchy-vol01/builder103.cloud.gluster.org-d-backends-vol01-brick0.pid -S /var/run/gluster/f4d6c8f7c3f85b18.socket --brick-name /d/backends/vol01/brick0 -l /var/log/glusterfs/bricks/d-backends-vol01-brick0.log --xlator-option *-posix.glusterd-uuid=0db25f79-8880-4f2d-b1e8-584e751ff0b9 --process-name brick --brick-port 49153 --xlator-option patchy-vol01-server.listen-port=49153)
From /var/log/messages:
Jul 27 16:28:42 builder103 kernel: [ 2902] 0 2902 3777638 200036 2322 0 0 glusterfsd
...
Jul 27 16:28:42 builder103 kernel: Out of memory: Kill process 2902 (glusterfsd) score 418 or sacrifice child
Jul 27 16:28:42 builder103 kernel: Killed process 2902 (glusterfsd) total-vm:15110552kB, anon-rss:800144kB, file-rss:0kB, shmem-rss:0kB
Jul 27 16:30:01 builder103 systemd: Created slice User Slice of root.
Possible OOM kill?
Regards,
Nithya
tests/bugs/glusterd/add-brick-and-validate-replicated- volume-options.t (Ref - https://build.gluster.org/job/ regression-test-with- ) - Test fails only in brick-mux mode, AI on Atin to look at and get back.multiplex/814/console
tests/bugs/replicate/bug-1433571-undo-pending-only-on- up-bricks.t (https://build.gluster.org/ job/regression-test-with- ) - Seems like failed just twice in last 30 days as per https://fstat.gluster.org/multiplex/813/console failure/251?state=2&start_ . Need help from AFR team.date=2018-06-30&end_date=2018- 07-31&branch=all
tests/bugs/quota/bug-1293601.t (https://build.gluster.org/job/regression-test-with- ) - Hasn't failed after 26 July and earlier it was failing regularly. Did we fix this test through any patch (Mohit?)multiplex/812/console
tests/bitrot/bug-1373520.t - (https://build.gluster.org/job/regression-test-with- ) - Hasn't failed after 27 July and earlier it was failing regularly. Did we fix this test through any patch (Mohit?)multiplex/811/console
tests/bugs/glusterd/remove-brick-testcases.t - Failed once with a core, not sure if related to brick mux or not, so not sure if brick mux is culprit here or not. Ref - https://build.gluster.org/job/ regression-test-with- . Seems to be a glustershd crash. Need help from AFR folks.multiplex/806/console
============================================================ ============================== ============================== ============================== ===================
Fails for non-brick mux case too
============================================================ ============================== ============================== ============================== ===================
tests/bugs/distribute/bug-1122443.t 0 Seems to be failing at my setup very often, with out brick mux as well. Refer https://build.gluster.org/job/ regression-test-burn-in/4050/ . There's an email in gluster-devel and a BZ 1610240 for the same.consoleText
tests/bugs/bug-1368312.t - Seems to be recent failures (https://build.gluster.org/job/regression-test-with- ) - seems to be a new failure, however seen this for a non-brick-mux case too - https://build.gluster.org/job/multiplex/815/console regression-test-burn-in/4039/ . Need some eyes from AFR folks.consoleText
tests/00-geo-rep/georep-basic-dr-tarssh.t - this isn't specific to brick mux, have seen this failing at multiple default regression runs. Refer https://fstat.gluster.org/ failure/392?state=2&start_ . We need help from geo-rep dev to root cause this earlier than laterdate=2018-06-30&end_date=2018- 07-31&branch=all
tests/00-geo-rep/georep-basic-dr-rsync.t - this isn't specific to brick mux, have seen this failing at multiple default regression runs. Refer https://fstat.gluster.org/ failure/393?state=2&start_ . We need help from geo-rep dev to root cause this earlier than laterdate=2018-06-30&end_date=2018- 07-31&branch=all
tests/bugs/glusterd/validating-server-quorum.t (https://build.gluster.org/ job/regression-test-with- ) - Fails for non-brick-mux cases too, https://fstat.gluster.org/multiplex/810/console failure/580?state=2&start_ . Atin has a patch https://review.gluster.org/date=2018-06-30&end_date=2018- 07-31&branch=all 20584 which resolves it but patch is failing regression for a different test which is unrelated.
tests/bugs/replicate/bug-1586020-mark-dirty-for-entry- txn-on-quorum-failure.t (Ref - https://build.gluster.org/job/ regression-test-with- ) - fails for non brick mux case too - https://build.gluster.org/job/multiplex/809/console regression-test-burn-in/4049/ - Need some eyes from AFR folks.consoleText
_______________________________________________
maintainers mailing list
maintainers@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/maintainers
_______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-devel