Fwd: [MPICH] non-blocking sending/receiving an array

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi

I am actually still having this packed send/receive problem, but it
happens sometimes, and then works fine some other times, lately it
works fine only if I use the following running command:

mpirun -np 4 valgrind --leak-check=full -v --log-file= val3.out myprog
myprogarguments

like when I run with valgrind, it is alright, and I think it is all
about pointers being shifted while receiving the packed array whether
blocking or non-blocking,MPI_Recv or MPI_Irecv, I  will need to run on
high performance machine, and won't be able to run it with valgrind
there, and need to make sure the program is stable and can run on
large data sizes without problems,

each process is multi-threaded in my program, but I tried to run the
program all sequential within the process (no threads), and the
problem is still the same, so, it is not about thread-safety or
synchronization,

I am copying the gcc list, may be I can get some insight about the
problem, and also some alternatives to ANSI C atoi or sprintf
alternative, because some of the valgrind problems are caused by
sprintf, and so far I couldn't find a safe alternative, the way I use
sprintf now is for example:

#define SHORT_MESSAGE_SIZE 200
char msg[SHORT_MESSAGE_SIZE];
sprintf (msg, "%ld: add OC w %ld, pi %ld, ci %ld, cs %ld, dp %d af %d
", OCout_ub, waveNo, partIndex, cellIndex,cellScore, depProc,
addflag);

then I print the msg to a debugging file corresponding to the process
and the thread it came out from,

the valgrind output is as shown below if you are interested to have a
look, mostly are mpi library implementation problems, rather than
mine, however, both problems, don't seem to cause all this
memory-shifting.



==4138== Memcheck, a memory error detector.
==4138== Copyright (C) 2002-2007, and GNU GPL'd, by Julian Seward et al.
==4138== Using LibVEX rev 1732, a library for dynamic binary translation.
==4138== Copyright (C) 2004-2007, and GNU GPL'd, by OpenWorks LLP.
==4138== Using valgrind-3.2.3, a dynamic binary instrumentation framework.
==4138== Copyright (C) 2000-2007, and GNU GPL'd, by Julian Seward et al.
==4138==
--4138-- Startup, with flags:
--4138--    --leak-check=full
--4138--    -v
--4138--    --log-file=val3.out
--4138-- Contents of /proc/version:
--4138--   Linux version 2.6.21-1.3194.fc7
(kojibuilder@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx ) (gcc version 4.1.2
20070502 (Red Hat 4.1.2-12)) #1 SMP Wed May 23 22:35:01 EDT 2007
--4138-- Arch and hwcaps: X86, x86-sse1-sse2
--4138-- Page sizes: currently 4096, max supported 4096
--4138-- Valgrind library directory: /usr/lib/valgrind
--4138-- Reading syms from /home/mhelal/thesis/exp/ver2.1/mmDst (0x8048000)
--4138-- Reading syms from /usr/lib/valgrind/x86-linux/memcheck (0x38000000)
--4138--    object doesn't have a dynamic symbol table
--4138-- Reading syms from /lib/ld-2.6.so (0x46C44000)
--4138-- Reading suppressions file: /usr/lib/valgrind/default.supp
--4138-- REDIR: 0x46C596F0 (index) redirected to 0x38027EDF
(vgPlain_x86_linux_REDIR_FOR_index)
--4138-- Reading syms from
/usr/lib/valgrind/x86-linux/vgpreload_core.so (0x4001000)
--4138-- Reading syms from
/usr/lib/valgrind/x86-linux/vgpreload_memcheck.so (0x4003000)
==4138== WARNING: new redirection conflicts with existing -- ignoring it
--4138--     new: 0x46C596F0 (index     ) R-> 0x040061F0 index
--4138-- REDIR: 0x46C59890 (strlen) redirected to 0x40062A0 (strlen)
--4138-- Reading syms from /lib/libm-2.6.so (0x4776B000)
--4138-- Reading syms from /lib/libpthread-2.6.so (0x479B7000)
--4138-- Reading syms from /home/mhelal/Install/mpi/lib/libmpich.so (0x4017000)
--4138-- Reading syms from /lib/librt- 2.6.so (0x46CC5000)
--4138-- Reading syms from /lib/libc-2.6.so (0x47615000)
==4138== Conditional jump or move depends on uninitialised value(s)
==4138==    at 0x46C4EBDB: _dl_relocate_object (in /lib/ld- 2.6.so)
==4138==    by 0x46C478D8: dl_main (in /lib/ld-2.6.so)
==4138==    by 0x46C57F6A: _dl_sysdep_start (in /lib/ld-2.6.so)
==4138==    by 0x46C452B7: _dl_start (in /lib/ld- 2.6.so)
==4138==    by 0x46C44816: (within /lib/ld-2.6.so)
==4138==
==4138== Conditional jump or move depends on uninitialised value(s)
==4138==    at 0x46C4EBE3: _dl_relocate_object (in /lib/ld- 2.6.so)
==4138==    by 0x46C478D8: dl_main (in /lib/ld-2.6.so)
==4138==    by 0x46C57F6A: _dl_sysdep_start (in /lib/ld-2.6.so)
==4138==    by 0x46C452B7: _dl_start (in /lib/ld- 2.6.so)
==4138==    by 0x46C44816: (within /lib/ld-2.6.so)
==4138==
==4138== Conditional jump or move depends on uninitialised value(s)
==4138==    at 0x46C4ED25: _dl_relocate_object (in /lib/ld- 2.6.so)
==4138==    by 0x46C478D8: dl_main (in /lib/ld-2.6.so)
==4138==    by 0x46C57F6A: _dl_sysdep_start (in /lib/ld-2.6.so)
==4138==    by 0x46C452B7: _dl_start (in /lib/ld- 2.6.so)
==4138==    by 0x46C44816: (within /lib/ld-2.6.so)
==4138==
==4138== Conditional jump or move depends on uninitialised value(s)
==4138==    at 0x46C4F01B: _dl_relocate_object (in /lib/ld- 2.6.so)
==4138==    by 0x46C478D8: dl_main (in /lib/ld-2.6.so)
==4138==    by 0x46C57F6A: _dl_sysdep_start (in /lib/ld-2.6.so)
==4138==    by 0x46C452B7: _dl_start (in /lib/ld- 2.6.so)
==4138==    by 0x46C44816: (within /lib/ld-2.6.so)
==4138==
==4138== Conditional jump or move depends on uninitialised value(s)
==4138==    at 0x46C4F4F0: _dl_relocate_object (in /lib/ld- 2.6.so)
==4138==    by 0x46C478D8: dl_main (in /lib/ld-2.6.so)
==4138==    by 0x46C57F6A: _dl_sysdep_start (in /lib/ld-2.6.so)
==4138==    by 0x46C452B7: _dl_start (in /lib/ld- 2.6.so)
==4138==    by 0x46C44816: (within /lib/ld-2.6.so)
==4138==
==4138== Conditional jump or move depends on uninitialised value(s)
==4138==    at 0x46C4EBDB: _dl_relocate_object (in /lib/ld- 2.6.so)
==4138==    by 0x46C47A84: dl_main (in /lib/ld-2.6.so)
==4138==    by 0x46C57F6A: _dl_sysdep_start (in /lib/ld-2.6.so)
==4138==    by 0x46C452B7: _dl_start (in /lib/ld- 2.6.so)
==4138==    by 0x46C44816: (within /lib/ld-2.6.so)
==4138==
==4138== Conditional jump or move depends on uninitialised value(s)
==4138==    at 0x46C4EBE3: _dl_relocate_object (in /lib/ld- 2.6.so)
==4138==    by 0x46C47A84: dl_main (in /lib/ld-2.6.so)
==4138==    by 0x46C57F6A: _dl_sysdep_start (in /lib/ld-2.6.so)
==4138==    by 0x46C452B7: _dl_start (in /lib/ld- 2.6.so)
==4138==    by 0x46C44816: (within /lib/ld-2.6.so)
==4138==
==4138== Conditional jump or move depends on uninitialised value(s)
==4138==    at 0x46C4ED25: _dl_relocate_object (in /lib/ld- 2.6.so)
==4138==    by 0x46C47A84: dl_main (in /lib/ld-2.6.so)
==4138==    by 0x46C57F6A: _dl_sysdep_start (in /lib/ld-2.6.so)
==4138==    by 0x46C452B7: _dl_start (in /lib/ld- 2.6.so)
==4138==    by 0x46C44816: (within /lib/ld-2.6.so)
--4138-- REDIR: 0x47684810 (memset) redirected to 0x4006600 (memset)
--4138-- REDIR: 0x47684D00 (memcpy) redirected to 0x4007030 (memcpy)
--4138-- REDIR: 0x47683930 (rindex) redirected to 0x40060D0 (rindex)
--4138-- REDIR: 0x4767EC90 (calloc) redirected to 0x400478D (calloc)
--4138-- REDIR: 0x47683590 (strlen) redirected to 0x4006280 (strlen)
--4138-- REDIR: 0x47683780 (strncmp) redirected to 0x40062E0 (strncmp)
--4138-- REDIR: 0x4767EF90 (malloc) redirected to 0x4005460 (malloc)
--4138-- REDIR: 0x476804F0 (free) redirected to 0x400507A (free)
--4138-- REDIR: 0x47684310 (memchr) redirected to 0x4006470 (memchr)
--4138-- REDIR: 0x47683880 (strncpy) redirected to 0x40068D0 (strncpy)
--4138-- REDIR: 0x47682EC0 (index) redirected to 0x40061C0 (index)
--4138-- REDIR: 0x476830A0 (strcpy) redirected to 0x4007290 (strcpy)
--4138-- REDIR: 0x47684870 (mempcpy) redirected to 0x4006B10 (mempcpy)
--4138-- REDIR: 0x47683030 (strcmp) redirected to 0x4006350 (strcmp)
==4138==
==4138== Syscall param writev(vector[...]) points to uninitialised byte(s)
==4138==    at 0x476DE118: writev (in /lib/libc-2.6.so)
==4138==    by 0x41056E8: MPIDU_Socki_handle_write (sock_wait.i:689)
==4138==    by 0x41044E3: MPIDU_Sock_wait (sock_wait.i:329)
==4138==    by 0x406E66E: MPIDI_CH3_Progress_wait (ch3_progress.c:189)
==4138==    by 0x40B52FF: MPIC_Wait (helper_fns.c:275)
==4138==    by 0x40B4C0B: MPIC_Sendrecv (helper_fns.c:121)
==4138==    by 0x405904A: MPIR_Allreduce (allreduce.c:284)
==4138==    by 0x405AA0D: PMPI_Allreduce (allreduce.c:684)
==4138==    by 0x4091B30: MPIR_Get_contextid (commutil.c:384)
==4138==    by 0x4089EB4: PMPI_Comm_create (comm_create.c:121)
==4138==    by 0x804B817: main (main.c:513)
==4138==  Address 0x41922E0 is 32 bytes inside a block of size 72 alloc'd
==4138==    at 0x40054E5: malloc (vg_replace_malloc.c:149)
==4138==    by 0x4071262: MPIDI_CH3I_Connection_alloc (ch3u_connect_sock.c:125)
==4138==    by 0x4073080: MPIDI_CH3I_VC_post_sockconnect
(ch3u_connect_sock.c:1023)
==4138==    by 0x406F8C4: MPIDI_CH3I_VC_post_connect (ch3_progress.c:857)
==4138==    by 0x406D5E2: MPIDI_CH3_iSendv (ch3_isendv.c:194)
==4138==    by 0x4073A1C: MPIDI_CH3_EagerContigIsend (ch3u_eager.c:460)
==4138==    by 0x40C66F4: MPID_Isend (mpid_isend.c:117)
==4138==    by 0x40B4BB0: MPIC_Sendrecv (helper_fns.c:117)
==4138==    by 0x405904A: MPIR_Allreduce ( allreduce.c:284)
==4138==    by 0x405AA0D: PMPI_Allreduce (allreduce.c:684)
==4138==    by 0x4091B30: MPIR_Get_contextid (commutil.c:384)
==4138==    by 0x4089EB4: PMPI_Comm_create (comm_create.c:121)
==4138==
==4138== Syscall param writev(vector[...]) points to uninitialised byte(s)
==4138==    at 0x476DE118: writev (in /lib/libc-2.6.so)
==4138==    by 0x41033C2: MPIDU_Sock_writev (sock_immed.i:604)
==4138==    by 0x406D08A: MPIDI_CH3_iSendv (ch3_isendv.c:83)
==4138==    by 0x4073A1C: MPIDI_CH3_EagerContigIsend (ch3u_eager.c:460)
==4138==    by 0x40C66F4: MPID_Isend (mpid_isend.c:117)
==4138==    by 0x40B4BB0: MPIC_Sendrecv (helper_fns.c:117)
==4138==    by 0x405904A: MPIR_Allreduce (allreduce.c:284)
==4138==    by 0x405AA0D: PMPI_Allreduce (allreduce.c:684)
==4138==    by 0x4091B30: MPIR_Get_contextid (commutil.c:384)
==4138==    by 0x4089EB4: PMPI_Comm_create (comm_create.c:121)
==4138==    by 0x804B817: main (main.c:513)
==4138==  Address 0xBEF02118 is on thread 1's stack
--4138-- REDIR: 0x476806E0 (realloc) redirected to 0x400550F (realloc)
==4138==
==4138== Thread 2:
==4138== Source and destination overlap in mempcpy(0x4C8BAA8, 0x4C8BAA8, 24)
==4138==    at 0x4006B94: mempcpy (mc_replace_strmem.c:116)
==4138==    by 0x47679314: _IO_default_xsputn (in /lib/libc-2.6.so)
==4138==    by 0x476544ED: vfprintf (in /lib/libc- 2.6.so)
==4138==    by 0x4766E4CB: vsprintf (in /lib/libc-2.6.so)
==4138==    by 0x4765A0BD: sprintf (in /lib/libc-2.6.so)
==4138==    by 0x80589D5: getPrevCells ( scoring.c:230)
==4138==    by 0x8058EF4: getScore (scoring.c:305)
==4138==    by 0x80599F3: ComputePartitionScores (scoring.c:470)
==4138==    by 0x804B215: ScoreCompThread (main.c:392)
==4138==    by 0x479BC2FA: start_thread (in /lib/libpthread- 2.6.so)
==4138==    by 0x476E593D: clone (in /lib/libc-2.6.so)


On 17/05/07, Blankenship, David  <David.Blankenship@xxxxxxxxxxxxxx> wrote:
  I am doing the same type of thing with the blocking calls. Here is how
I am doing it. This code uses the C++ MPI interface.

// Probe for a message from any source
MPI::COMM_WORLD.Probe( MPI_ANY_SOURCE, MPI_ANY_TAG, cMPIStatus );
int iMessageLength = cMPIStatus.Get_count( MPI_CHAR );
// Here I resize my receive buffer if necessary

// Receive the message that was just probed
int iSource = cMPIStatus.Get_source();
MPI::COMM_WORLD.Recv( &(cBuffer[0],  cBuffer.size(), MPI_CHAR, iSource,
MPI_ANY_TAG, cMPIStatus );


You could also use the tag to differentiate messages from a single
source. This does eliminate the need to send 2 messages, one with the
size and then one with the array. That is what I liked most about this
solution.

I hope this helps.

David Blankenship



-----Original Message-----
From: owner-mpich-discuss@xxxxxxxxxxx
[mailto: owner-mpich-discuss@xxxxxxxxxxx] On Behalf Of Manal Helal
Sent: Wednesday, May 16, 2007 2:44 AM
To: mpich-discuss-digest@xxxxxxxxxxx
Subject: [MPICH] non-blocking sending/receiving an array

Hi

I am trying to send an array, I send its size first, and then send the
array itself, however, I am sending in a loop and receiving in a loop,
so I end up receiving in different order, like I receive the array
size, and then receive from the same sender the array of different
size sent at another iteration, and I am using non-blocking
communication,  and testing now for 3 processes, but could be more
later, so, I can only specify the sender in the receive of the array,
as the one I received the array size from, but I can't specify the
size, it is giving me:

rank 2 in job 4  localhost.localdomain_54476   caused collective abort
of all ranks
  exit status of rank 2: killed by signal 9
2:  MPI_Wait(140)..........................:
MPI_Wait(request=0xb6b55198, status0xb6b5519c) failed
2:  MPIDI_CH3U_Post_data_receive_found(163): Message from rank 0 and
tag 92 truncated; 224 bytes received but buffer size is 56


is there a way to probe for a specific size, and receive only if this
is the size, in the MPI_Iprobe, there is no specification for the
count,

any ideas will greatly help,

Thank you very much, Kind Regards,

Manal



[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux