Hi I am actually still having this packed send/receive problem, but it happens sometimes, and then works fine some other times, lately it works fine only if I use the following running command: mpirun -np 4 valgrind --leak-check=full -v --log-file= val3.out myprog myprogarguments like when I run with valgrind, it is alright, and I think it is all about pointers being shifted while receiving the packed array whether blocking or non-blocking,MPI_Recv or MPI_Irecv, I will need to run on high performance machine, and won't be able to run it with valgrind there, and need to make sure the program is stable and can run on large data sizes without problems, each process is multi-threaded in my program, but I tried to run the program all sequential within the process (no threads), and the problem is still the same, so, it is not about thread-safety or synchronization, I am copying the gcc list, may be I can get some insight about the problem, and also some alternatives to ANSI C atoi or sprintf alternative, because some of the valgrind problems are caused by sprintf, and so far I couldn't find a safe alternative, the way I use sprintf now is for example: #define SHORT_MESSAGE_SIZE 200 char msg[SHORT_MESSAGE_SIZE]; sprintf (msg, "%ld: add OC w %ld, pi %ld, ci %ld, cs %ld, dp %d af %d ", OCout_ub, waveNo, partIndex, cellIndex,cellScore, depProc, addflag); then I print the msg to a debugging file corresponding to the process and the thread it came out from, the valgrind output is as shown below if you are interested to have a look, mostly are mpi library implementation problems, rather than mine, however, both problems, don't seem to cause all this memory-shifting. ==4138== Memcheck, a memory error detector. ==4138== Copyright (C) 2002-2007, and GNU GPL'd, by Julian Seward et al. ==4138== Using LibVEX rev 1732, a library for dynamic binary translation. ==4138== Copyright (C) 2004-2007, and GNU GPL'd, by OpenWorks LLP. ==4138== Using valgrind-3.2.3, a dynamic binary instrumentation framework. ==4138== Copyright (C) 2000-2007, and GNU GPL'd, by Julian Seward et al. ==4138== --4138-- Startup, with flags: --4138-- --leak-check=full --4138-- -v --4138-- --log-file=val3.out --4138-- Contents of /proc/version: --4138-- Linux version 2.6.21-1.3194.fc7 (kojibuilder@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx ) (gcc version 4.1.2 20070502 (Red Hat 4.1.2-12)) #1 SMP Wed May 23 22:35:01 EDT 2007 --4138-- Arch and hwcaps: X86, x86-sse1-sse2 --4138-- Page sizes: currently 4096, max supported 4096 --4138-- Valgrind library directory: /usr/lib/valgrind --4138-- Reading syms from /home/mhelal/thesis/exp/ver2.1/mmDst (0x8048000) --4138-- Reading syms from /usr/lib/valgrind/x86-linux/memcheck (0x38000000) --4138-- object doesn't have a dynamic symbol table --4138-- Reading syms from /lib/ld-2.6.so (0x46C44000) --4138-- Reading suppressions file: /usr/lib/valgrind/default.supp --4138-- REDIR: 0x46C596F0 (index) redirected to 0x38027EDF (vgPlain_x86_linux_REDIR_FOR_index) --4138-- Reading syms from /usr/lib/valgrind/x86-linux/vgpreload_core.so (0x4001000) --4138-- Reading syms from /usr/lib/valgrind/x86-linux/vgpreload_memcheck.so (0x4003000) ==4138== WARNING: new redirection conflicts with existing -- ignoring it --4138-- new: 0x46C596F0 (index ) R-> 0x040061F0 index --4138-- REDIR: 0x46C59890 (strlen) redirected to 0x40062A0 (strlen) --4138-- Reading syms from /lib/libm-2.6.so (0x4776B000) --4138-- Reading syms from /lib/libpthread-2.6.so (0x479B7000) --4138-- Reading syms from /home/mhelal/Install/mpi/lib/libmpich.so (0x4017000) --4138-- Reading syms from /lib/librt- 2.6.so (0x46CC5000) --4138-- Reading syms from /lib/libc-2.6.so (0x47615000) ==4138== Conditional jump or move depends on uninitialised value(s) ==4138== at 0x46C4EBDB: _dl_relocate_object (in /lib/ld- 2.6.so) ==4138== by 0x46C478D8: dl_main (in /lib/ld-2.6.so) ==4138== by 0x46C57F6A: _dl_sysdep_start (in /lib/ld-2.6.so) ==4138== by 0x46C452B7: _dl_start (in /lib/ld- 2.6.so) ==4138== by 0x46C44816: (within /lib/ld-2.6.so) ==4138== ==4138== Conditional jump or move depends on uninitialised value(s) ==4138== at 0x46C4EBE3: _dl_relocate_object (in /lib/ld- 2.6.so) ==4138== by 0x46C478D8: dl_main (in /lib/ld-2.6.so) ==4138== by 0x46C57F6A: _dl_sysdep_start (in /lib/ld-2.6.so) ==4138== by 0x46C452B7: _dl_start (in /lib/ld- 2.6.so) ==4138== by 0x46C44816: (within /lib/ld-2.6.so) ==4138== ==4138== Conditional jump or move depends on uninitialised value(s) ==4138== at 0x46C4ED25: _dl_relocate_object (in /lib/ld- 2.6.so) ==4138== by 0x46C478D8: dl_main (in /lib/ld-2.6.so) ==4138== by 0x46C57F6A: _dl_sysdep_start (in /lib/ld-2.6.so) ==4138== by 0x46C452B7: _dl_start (in /lib/ld- 2.6.so) ==4138== by 0x46C44816: (within /lib/ld-2.6.so) ==4138== ==4138== Conditional jump or move depends on uninitialised value(s) ==4138== at 0x46C4F01B: _dl_relocate_object (in /lib/ld- 2.6.so) ==4138== by 0x46C478D8: dl_main (in /lib/ld-2.6.so) ==4138== by 0x46C57F6A: _dl_sysdep_start (in /lib/ld-2.6.so) ==4138== by 0x46C452B7: _dl_start (in /lib/ld- 2.6.so) ==4138== by 0x46C44816: (within /lib/ld-2.6.so) ==4138== ==4138== Conditional jump or move depends on uninitialised value(s) ==4138== at 0x46C4F4F0: _dl_relocate_object (in /lib/ld- 2.6.so) ==4138== by 0x46C478D8: dl_main (in /lib/ld-2.6.so) ==4138== by 0x46C57F6A: _dl_sysdep_start (in /lib/ld-2.6.so) ==4138== by 0x46C452B7: _dl_start (in /lib/ld- 2.6.so) ==4138== by 0x46C44816: (within /lib/ld-2.6.so) ==4138== ==4138== Conditional jump or move depends on uninitialised value(s) ==4138== at 0x46C4EBDB: _dl_relocate_object (in /lib/ld- 2.6.so) ==4138== by 0x46C47A84: dl_main (in /lib/ld-2.6.so) ==4138== by 0x46C57F6A: _dl_sysdep_start (in /lib/ld-2.6.so) ==4138== by 0x46C452B7: _dl_start (in /lib/ld- 2.6.so) ==4138== by 0x46C44816: (within /lib/ld-2.6.so) ==4138== ==4138== Conditional jump or move depends on uninitialised value(s) ==4138== at 0x46C4EBE3: _dl_relocate_object (in /lib/ld- 2.6.so) ==4138== by 0x46C47A84: dl_main (in /lib/ld-2.6.so) ==4138== by 0x46C57F6A: _dl_sysdep_start (in /lib/ld-2.6.so) ==4138== by 0x46C452B7: _dl_start (in /lib/ld- 2.6.so) ==4138== by 0x46C44816: (within /lib/ld-2.6.so) ==4138== ==4138== Conditional jump or move depends on uninitialised value(s) ==4138== at 0x46C4ED25: _dl_relocate_object (in /lib/ld- 2.6.so) ==4138== by 0x46C47A84: dl_main (in /lib/ld-2.6.so) ==4138== by 0x46C57F6A: _dl_sysdep_start (in /lib/ld-2.6.so) ==4138== by 0x46C452B7: _dl_start (in /lib/ld- 2.6.so) ==4138== by 0x46C44816: (within /lib/ld-2.6.so) --4138-- REDIR: 0x47684810 (memset) redirected to 0x4006600 (memset) --4138-- REDIR: 0x47684D00 (memcpy) redirected to 0x4007030 (memcpy) --4138-- REDIR: 0x47683930 (rindex) redirected to 0x40060D0 (rindex) --4138-- REDIR: 0x4767EC90 (calloc) redirected to 0x400478D (calloc) --4138-- REDIR: 0x47683590 (strlen) redirected to 0x4006280 (strlen) --4138-- REDIR: 0x47683780 (strncmp) redirected to 0x40062E0 (strncmp) --4138-- REDIR: 0x4767EF90 (malloc) redirected to 0x4005460 (malloc) --4138-- REDIR: 0x476804F0 (free) redirected to 0x400507A (free) --4138-- REDIR: 0x47684310 (memchr) redirected to 0x4006470 (memchr) --4138-- REDIR: 0x47683880 (strncpy) redirected to 0x40068D0 (strncpy) --4138-- REDIR: 0x47682EC0 (index) redirected to 0x40061C0 (index) --4138-- REDIR: 0x476830A0 (strcpy) redirected to 0x4007290 (strcpy) --4138-- REDIR: 0x47684870 (mempcpy) redirected to 0x4006B10 (mempcpy) --4138-- REDIR: 0x47683030 (strcmp) redirected to 0x4006350 (strcmp) ==4138== ==4138== Syscall param writev(vector[...]) points to uninitialised byte(s) ==4138== at 0x476DE118: writev (in /lib/libc-2.6.so) ==4138== by 0x41056E8: MPIDU_Socki_handle_write (sock_wait.i:689) ==4138== by 0x41044E3: MPIDU_Sock_wait (sock_wait.i:329) ==4138== by 0x406E66E: MPIDI_CH3_Progress_wait (ch3_progress.c:189) ==4138== by 0x40B52FF: MPIC_Wait (helper_fns.c:275) ==4138== by 0x40B4C0B: MPIC_Sendrecv (helper_fns.c:121) ==4138== by 0x405904A: MPIR_Allreduce (allreduce.c:284) ==4138== by 0x405AA0D: PMPI_Allreduce (allreduce.c:684) ==4138== by 0x4091B30: MPIR_Get_contextid (commutil.c:384) ==4138== by 0x4089EB4: PMPI_Comm_create (comm_create.c:121) ==4138== by 0x804B817: main (main.c:513) ==4138== Address 0x41922E0 is 32 bytes inside a block of size 72 alloc'd ==4138== at 0x40054E5: malloc (vg_replace_malloc.c:149) ==4138== by 0x4071262: MPIDI_CH3I_Connection_alloc (ch3u_connect_sock.c:125) ==4138== by 0x4073080: MPIDI_CH3I_VC_post_sockconnect (ch3u_connect_sock.c:1023) ==4138== by 0x406F8C4: MPIDI_CH3I_VC_post_connect (ch3_progress.c:857) ==4138== by 0x406D5E2: MPIDI_CH3_iSendv (ch3_isendv.c:194) ==4138== by 0x4073A1C: MPIDI_CH3_EagerContigIsend (ch3u_eager.c:460) ==4138== by 0x40C66F4: MPID_Isend (mpid_isend.c:117) ==4138== by 0x40B4BB0: MPIC_Sendrecv (helper_fns.c:117) ==4138== by 0x405904A: MPIR_Allreduce ( allreduce.c:284) ==4138== by 0x405AA0D: PMPI_Allreduce (allreduce.c:684) ==4138== by 0x4091B30: MPIR_Get_contextid (commutil.c:384) ==4138== by 0x4089EB4: PMPI_Comm_create (comm_create.c:121) ==4138== ==4138== Syscall param writev(vector[...]) points to uninitialised byte(s) ==4138== at 0x476DE118: writev (in /lib/libc-2.6.so) ==4138== by 0x41033C2: MPIDU_Sock_writev (sock_immed.i:604) ==4138== by 0x406D08A: MPIDI_CH3_iSendv (ch3_isendv.c:83) ==4138== by 0x4073A1C: MPIDI_CH3_EagerContigIsend (ch3u_eager.c:460) ==4138== by 0x40C66F4: MPID_Isend (mpid_isend.c:117) ==4138== by 0x40B4BB0: MPIC_Sendrecv (helper_fns.c:117) ==4138== by 0x405904A: MPIR_Allreduce (allreduce.c:284) ==4138== by 0x405AA0D: PMPI_Allreduce (allreduce.c:684) ==4138== by 0x4091B30: MPIR_Get_contextid (commutil.c:384) ==4138== by 0x4089EB4: PMPI_Comm_create (comm_create.c:121) ==4138== by 0x804B817: main (main.c:513) ==4138== Address 0xBEF02118 is on thread 1's stack --4138-- REDIR: 0x476806E0 (realloc) redirected to 0x400550F (realloc) ==4138== ==4138== Thread 2: ==4138== Source and destination overlap in mempcpy(0x4C8BAA8, 0x4C8BAA8, 24) ==4138== at 0x4006B94: mempcpy (mc_replace_strmem.c:116) ==4138== by 0x47679314: _IO_default_xsputn (in /lib/libc-2.6.so) ==4138== by 0x476544ED: vfprintf (in /lib/libc- 2.6.so) ==4138== by 0x4766E4CB: vsprintf (in /lib/libc-2.6.so) ==4138== by 0x4765A0BD: sprintf (in /lib/libc-2.6.so) ==4138== by 0x80589D5: getPrevCells ( scoring.c:230) ==4138== by 0x8058EF4: getScore (scoring.c:305) ==4138== by 0x80599F3: ComputePartitionScores (scoring.c:470) ==4138== by 0x804B215: ScoreCompThread (main.c:392) ==4138== by 0x479BC2FA: start_thread (in /lib/libpthread- 2.6.so) ==4138== by 0x476E593D: clone (in /lib/libc-2.6.so) On 17/05/07, Blankenship, David <David.Blankenship@xxxxxxxxxxxxxx> wrote:
I am doing the same type of thing with the blocking calls. Here is how I am doing it. This code uses the C++ MPI interface. // Probe for a message from any source MPI::COMM_WORLD.Probe( MPI_ANY_SOURCE, MPI_ANY_TAG, cMPIStatus ); int iMessageLength = cMPIStatus.Get_count( MPI_CHAR ); // Here I resize my receive buffer if necessary // Receive the message that was just probed int iSource = cMPIStatus.Get_source(); MPI::COMM_WORLD.Recv( &(cBuffer[0], cBuffer.size(), MPI_CHAR, iSource, MPI_ANY_TAG, cMPIStatus ); You could also use the tag to differentiate messages from a single source. This does eliminate the need to send 2 messages, one with the size and then one with the array. That is what I liked most about this solution. I hope this helps. David Blankenship -----Original Message----- From: owner-mpich-discuss@xxxxxxxxxxx [mailto: owner-mpich-discuss@xxxxxxxxxxx] On Behalf Of Manal Helal Sent: Wednesday, May 16, 2007 2:44 AM To: mpich-discuss-digest@xxxxxxxxxxx Subject: [MPICH] non-blocking sending/receiving an array Hi I am trying to send an array, I send its size first, and then send the array itself, however, I am sending in a loop and receiving in a loop, so I end up receiving in different order, like I receive the array size, and then receive from the same sender the array of different size sent at another iteration, and I am using non-blocking communication, and testing now for 3 processes, but could be more later, so, I can only specify the sender in the receive of the array, as the one I received the array size from, but I can't specify the size, it is giving me: rank 2 in job 4 localhost.localdomain_54476 caused collective abort of all ranks exit status of rank 2: killed by signal 9 2: MPI_Wait(140)..........................: MPI_Wait(request=0xb6b55198, status0xb6b5519c) failed 2: MPIDI_CH3U_Post_data_receive_found(163): Message from rank 0 and tag 92 truncated; 224 bytes received but buffer size is 56 is there a way to probe for a specific size, and receive only if this is the size, in the MPI_Iprobe, there is no specification for the count, any ideas will greatly help, Thank you very much, Kind Regards, Manal