Hi Keithley,
Please find the bug in the attached log file. I experienced this bug on both 3.4.0 and 3.4.1. There is no problem with GlusterFS 3.2.7.
I use IOR 3.0.1 for the test. The connection is IBoIP. OFED is come from Mellanox (1.5.3). OS is CentOS 6.4.
About the GlusterFS on RHEL 6.5, I wonder that why there is no glusterfs-server and glusterfs-geo-replication packages?
On Tue, Dec 3, 2013 at 1:03 AM, Kaleb S. KEITHLEY <kkeithle@xxxxxxxxxx> wrote:
On 12/02/2013 10:52 AM, Nguyen Viet Cuong wrote:Have you filed a bug?
Hi,
Actually, I have very bad experience with GlusterFS 3.3.x and 3.4.x
under very high pressure (> 64 processes write in parallel in more than
10 minutes, for example).
You can get the glusterfs RPMs that were built in the Fedora Koji build system at
GlusterFS 3.2.7 from EPEL is really stable and
we use it for production.
Unfortunately, there is no official built of GlusterFS 3.2.x on
Gluster's repo.
https://koji.fedoraproject.org/koji/packageinfo?packageID=5443
and in particular the 3.2.7 el6 RPMs are at
https://koji.fedoraproject.org/koji/buildinfo?buildID=323952
--
Kaleb
Nguyen Viet Cuong
IOR-3.0.1: MPI Coordinated Test of Parallel I/O Began: Thu Nov 28 20:10:06 2013 Command line used: /opt/IOR/bin/IOR -a POSIX -F -m -t 1M -b 1G -s 2 -i 5 -k -r -R -w -W -e -o /mnt/IOR Machine: Linux cn01.local Test 0 started: Thu Nov 28 20:10:06 2013 Summary: api = POSIX test filename = /mnt/IOR access = file-per-process ordering in a file = sequential offsets ordering inter file= no tasks offsets clients = 128 (8 per node) repetitions = 5 xfersize = 1 MiB blocksize = 1 GiB aggregate filesize = 256 GiB access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---------- --------- -------- -------- -------- -------- ---- write 584.86 1048576 1024.00 43.68 448.19 11.28 448.22 0 read 885.50 1048576 1024.00 0.383256 295.97 0.016442 296.04 0 write 573.92 1048576 1024.00 60.53 456.69 11.56 456.76 1 read 956.56 1048576 1024.00 0.321138 274.03 0.031678 274.05 1 write 590.64 1048576 1024.00 65.95 443.77 11.66 443.83 2 read 1254.45 1048576 1024.00 0.383224 208.95 0.049748 208.97 2 write 585.15 1048576 1024.00 67.88 447.94 12.07 447.99 3 read 1298.00 1048576 1024.00 0.364708 201.94 0.067993 201.96 3 WARNING: Task 43, partial write(), 131072 of 1048576 bytes at offset 389021696 WARNING: Task 107, partial write(), 524288 of 1048576 bytes at offset 176160768 WARNING: Task 11, partial write(), 393216 of 1048576 bytes at offset 39845888 ior ERROR: write() failed, errno 107, Transport endpoint is not connected (aiori-POSIX.c:236) ior ERROR: write() failed, errno 107, Transport endpoint is not connected (aiori-POSIX.c:236) ior ERROR: write() failed, errno 107, Transport endpoint is not connected (aiori-POSIX.c:236) -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 43 in communicator MPI_COMM_WORLD with errorcode -1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) -------------------------------------------------------------------------- WARNING: A process refused to die! Host: cn04.local PID: 2808 This process may still be running and/or consuming resources. -------------------------------------------------------------------------- MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) [fs.local:17668] 2 more processes have sent help message help-mpi-api.txt / mpi-abort [fs.local:17668] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [fs.local:17668] 1 more process has sent help message help-odls-default.txt / odls-default:could-not-kill MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) -------------------------------------------------------------------------- mpirun has exited due to process rank 107 with PID 2725 on node lustre04.local exiting improperly. There are two reasons this could occur: 1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination. 2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination" This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) MXM: Got signal 15 (Terminated) [fs.local:17668] 1 more process has sent help message help-odls-default.txt / odls-default:could-not-kill
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-users