Here is the stack trace on a successfull run, borrowed from the unit tests, to confirm the code path : http://tracker.ceph.com/issues/7914#note-27 On 02/04/2014 19:51, Loic Dachary wrote: > Given the parameters to jerasure_matrix_dotprod the code path should be: > > https://github.com/ceph/jerasure/blob/v2-ceph/src/galois.c#L338 (because nbytes == 2048) > https://github.com/ceph/jerasure/blob/v2-ceph/src/galois.c#L332 > https://github.com/ceph/gf-complete/blob/v1-ceph/src/gf_w32.c#L569 (because INTEL_SSE4_PCLMUL has been used at compile time and the CPUID detected at runtime has the required features as selected in https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodePluginSelectJerasure.cc#L49 ) > > what should happen after that ? h->prim_poly will select something but what exactly... Could it be that the lack of stack means https://github.com/ceph/jerasure/blob/v2-ceph/src/galois.c#L332 references a NULL or invalid gfp_array[32] ? Or could it be that src/dest pointers are pointing to invalid memory ? > > Bugs that can't be reproduced are the best ;-) > > On 02/04/2014 19:35, Loic Dachary wrote:> Hi Kevin, >> >> In the context of http://tracker.ceph.com/issues/7914 we're trying to figure out why jerasure dumps core. We don't know how to reproduce it yet (ran dozens of identical tests suites with no such crash in the past few days, which is to be expected for rare bugs because the test suite introduces random errors / failures on purpose). >> >> The full stack trace is at http://tracker.ceph.com/issues/7914#note-24 but the relevant part is here: >> >> #0 0x00007f4756779b7b in raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42 >> #1 0x0000000000981b4e in reraise_fatal (signum=11) at global/signal_handler.cc:59 >> #2 handle_fatal_signal (signum=11) at global/signal_handler.cc:105 >> #3 <signal handler called> >> #4 0x0000000000000000 in ?? () >> #5 0x00007f47385ae6b1 in jerasure_matrix_dotprod (k=2, w=8, matrix_row=0x31513a8, src_ids=0x0, dest_id=<optimized out>, data_ptrs=0x7f4741ec7a00, coding_ptrs=0x7f4741ec7a10, >> size=2048) at erasure-code/jerasure/jerasure/src/jerasure.c:607 >> #6 0x00007f47385ae7d6 in jerasure_matrix_encode (k=2, m=1, w=8, matrix=<optimized out>, data_ptrs=0x7f4741ec7a00, coding_ptrs=0x7f4741ec7a10, size=2048) >> at erasure-code/jerasure/jerasure/src/jerasure.c:310 >> ... >> >> Note that this jerasure/gf-complete combination has been compiled with SSE4.1, SSE4.2, PCLMUL, SSSE3, SSE3, SSE2, SSE flags activated. These are jerasure v2 and gf-complete v1, only slightly modified as found in https://github.com/ceph/jerasure/tree/v2-ceph and https://github.com/ceph/gf-complete/tree/v1-ceph (all commits there have a pending pull request under https://bitbucket.org/jimplank/gf-complete https://bitbucket.org/jimplank/jerasure, nothing you've not seen before). >> >> #5 is https://github.com/ceph/jerasure/blob/v2-ceph/src/jerasure.c#L607 >> >> and then it dives into gf-complete and most probably destroyed part of the stack when corrupting memory. I'll be chasing this tomorrow. If you have a brilliant idea on why that happens, I'll take it ;-) >> >> Cheers >> > -- Loïc Dachary, Artisan Logiciel Libre
Attachment:
signature.asc
Description: OpenPGP digital signature