Re: test failure reports for last 15 days

FNU Raghavendra Manjunath <rabhat@xxxxxxxxxx> · Thu, 11 Apr 2019 15:18:41 -0400

While analysing the logs of the runs where uss.t failed made following observations.

1) In the first iteration of uss.t, the time difference between the first test of the .t file and the last test of the .t file is just within 1 minute.

But, I think it is the cleanup sequence which is taking more time. One of the reasons I guess this is happening is, we dont see the brick process shutting down message
in the logs.

2) In the 2nd iteration of uss.t (because 1st iteration failed because of timeout) it fails because something has not been completed in the cleanup sequence of the previous iteration.

The volume start command itself fails in the 2nd iteration. Because of that the remaining tests also fail

This is from cmd_history.log 

uster.org:/d/backends/2/patchy_snap_mnt builder202.int.aws.gluster.org:/d/backends/3/patchy_snap_mnt ++++++++++
[2019-04-10 19:54:09.145086]  : volume create patchy builder202.int.aws.gluster.org:/d/backends/1/patchy_snap_mnt builder202.int.aws.gluster.org:/d/backends/2/patchy_snap_mnt builder202.int.aws.gluster.org:/d/backends/3/patchy_snap_mnt : SUCCESS
[2019-04-10 19:54:09.156221]:++++++++++ G_LOG:./tests/basic/uss.t: TEST: 39 gluster --mode=script --wignore volume set patchy nfs.disable false ++++++++++
[2019-04-10 19:54:09.265138]  : volume set patchy nfs.disable false : SUCCESS
[2019-04-10 19:54:09.274386]:++++++++++ G_LOG:./tests/basic/uss.t: TEST: 42 gluster --mode=script --wignore volume start patchy ++++++++++
[2019-04-10 19:54:09.565086]  : volume start patchy : FAILED : Commit failed on localhost. Please check log file for details.
[2019-04-10 19:54:09.572753]:++++++++++ G_LOG:./tests/basic/uss.t: TEST: 44 _GFS --attribute-timeout=0 --entry-timeout=0 --volfile-server=builder202.int.aws.gluster.org --volfile-id=patchy /mnt/glusterfs/0 ++++++++++

And this is from the brick showing some issue with the export directory not being present properly.

[2019-04-10 19:54:09.544476] I [MSGID: 100030] [glusterfsd.c:2857:main] 0-/build/install/sbin/glusterfsd: Started running /build/install/sbin/glusterfsd version 7dev (args: /build/install/sbin/glusterfsd -s buil
der202.int.aws.gluster.org --volfile-id patchy.builder202.int.aws.gluster.org.d-backends-1-patchy_snap_mnt -p /var/run/gluster/vols/patchy/builder202.int.aws.gluster.org-d-backends-1-patchy_snap_mnt.pid -S /var/
run/gluster/7ac65190b72da80a.socket --brick-name /d/backends/1/patchy_snap_mnt -l /var/log/glusterfs/bricks/d-backends-1-patchy_snap_mnt.log --xlator-option *-posix.glusterd-uuid=695c060d-74d3-440e-8cdb-327ec297
f2d2 --process-name brick --brick-port 49152 --xlator-option patchy-server.listen-port=49152)
[2019-04-10 19:54:09.549394] I [socket.c:962:__socket_server_bind] 0-socket.glusterfsd: closing (AF_UNIX) reuse check socket 9
[2019-04-10 19:54:09.553190] I [MSGID: 101190] [event-epoll.c:680:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2019-04-10 19:54:09.553209] I [MSGID: 101190] [event-epoll.c:680:event_dispatch_epoll_worker] 0-epoll: Started thread with index 0
[2019-04-10 19:54:09.556932] I [rpcsvc.c:2694:rpcsvc_set_outstanding_rpc_limit] 0-rpc-service: Configured rpc.outstanding-rpc-limit with value 64
[2019-04-10 19:54:09.557859] E [MSGID: 138001] [index.c:2392:init] 0-patchy-index: Failed to find parent dir (/d/backends/1/patchy_snap_mnt/.glusterfs) of index basepath /d/backends/1/patchy_snap_mnt/.glusterfs/
indices. [No such file or directory]        ============================> (.glusterfs is absent)
[2019-04-10 19:54:09.557884] E [MSGID: 101019] [xlator.c:629:xlator_init] 0-patchy-index: Initialization of volume 'patchy-index' failed, review your volfile again
[2019-04-10 19:54:09.557892] E [MSGID: 101066] [graph.c:409:glusterfs_graph_init] 0-patchy-index: initializing translator failed
[2019-04-10 19:54:09.557900] E [MSGID: 101176] [graph.c:772:glusterfs_graph_activate] 0-graph: init failed
[2019-04-10 19:54:09.564154] I [io-stats.c:4033:fini] 0-patchy-io-stats: io-stats translator unloaded
[2019-04-10 19:54:09.564748] W [glusterfsd.c:1592:cleanup_and_exit] (-->/build/install/sbin/glusterfsd(mgmt_getspec_cbk+0x806) [0x411f32] -->/build/install/sbin/glusterfsd(glusterfs_process_volfp+0x272) [0x40b9b
9] -->/build/install/sbin/glusterfsd(cleanup_and_exit+0x88) [0x4093a5] ) 0-: received signum (-1), shutting down

And this is from the cmd_history.log file of the 2nd iteration uss.t from another jenkins run of uss.t

[2019-04-10 15:35:51.927343]:++++++++++ G_LOG:./tests/basic/uss.t: TEST: 39 gluster --mode=script --wignore volume set patchy nfs.disable false ++++++++++
[2019-04-10 15:35:52.038072]  : volume set patchy nfs.disable false : SUCCESS
[2019-04-10 15:35:52.057582]:++++++++++ G_LOG:./tests/basic/uss.t: TEST: 42 gluster --mode=script --wignore volume start patchy ++++++++++
[2019-04-10 15:35:52.104288]  : volume start patchy : FAILED : Failed to find brick directory /d/backends/1/patchy_snap_mnt for volume patchy. Reason : No such file or directory =========> (export directory is not present)
[2019-04-10 15:35:52.117735]:++++++++++ G_LOG:./tests/basic/uss.t: TEST: 44 _GFS --attribute-timeout=0 --entry-timeout=0 --volfile-server=builder205.int.aws.gluster.org --volfile-id=patchy /mnt/glusterfs/0 ++++++++++

I suspect something wrong with the cleanup sequence which causes the timeout of the test in the 1st iteration and the export directory issues in the next iteration causes the failure of uss.t in the 2nd iteration.

Regards,
Raghavendra

On Wed, Apr 10, 2019 at 4:07 PM FNU Raghavendra Manjunath <rabhat@xxxxxxxxxx> wrote:

On Wed, Apr 10, 2019 at 9:59 AM Atin Mukherjee <amukherj@xxxxxxxxxx> wrote:
And now for last 15 days:

https://fstat.gluster.org/summary?start_date=2019-03-25&end_date=2019-04-10

./tests/bitrot/bug-1373520.t     18  ==> Fixed through https://review.gluster.org/#/c/glusterfs/+/22481/, I don't see this failing in brick mux post 5th April

The above patch has been sent to fix the failure with brick mux enabled.

./tests/bugs/ec/bug-1236065.t     17  ==> happens only in brick mux, needs analysis.
./tests/basic/uss.t             15  ==> happens in both brick mux and non brick mux runs, test just simply times out. Needs urgent analysis.

Nothing has changed in snapview-server and snapview-client recently. Looking into it.

./tests/basic/ec/ec-fix-openfd.t 13  ==> Fixed through https://review.gluster.org/#/c/22508/ , patch merged today. 
./tests/basic/volfile-sanity.t      8  ==> Some race, though this succeeds in second attempt every time.

There're plenty more with 5 instances of failure from many tests. We need all maintainers/owners to look through these failures and fix them, we certainly don't want to get into a stage where master is unstable and we have to lock down the merges till all these failures are resolved. So please help.

(Please note fstat stats show up the retries as failures too which in a way is right)

On Tue, Feb 26, 2019 at 5:27 PM Atin Mukherjee <amukherj@xxxxxxxxxx> wrote:
[1] captures the test failures report since last 30 days and we'd need volunteers/component owners to see why the number of failures are so high against few tests.

[1] https://fstat.gluster.org/summary?start_date=2019-01-26&end_date=2019-02-25&job=all

_______________________________________________
Gluster-devel mailing list

Gluster-devel@xxxxxxxxxxx

https://lists.gluster.org/mailman/listinfo/gluster-devel

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel