Re: Binary compatibility between an old static libstdc++ and a new dynamic one

Guilherme Quentel Melo <gqmelo@xxxxxxxxx> · Wed, 17 May 2017 18:34:36 -0300

On 2 May 2017 at 16:04, Guilherme Quentel Melo <gqmelo@xxxxxxxxx> wrote:
> On 13 April 2017 at 15:06, Jonathan Wakely <jwakely.gcc@xxxxxxxxx> wrote:
>> On 13 April 2017 at 14:26, Guilherme Quentel Melo wrote:
>>> Thanks for replying Jonathan
>>>
>>>
>>>> It's not supported to mix C++11 code compiled with 4.x and GCC 5+, in
>>>> any way, whether linking dynamically or statically.
>>>
>>> OK, but this is true even when the API is C? In this case no c++ structure
>>> is ever passed to mesa. If mesa was compiled with the new ABI, I should
>>> still be fine, right?
>>
>> Right.
>>
>>>> If you're only using C++98 (and of course only using the old COW
>>>> std::string in the code compiled with GCC 5+)
>>>
>>> Yeah, my gcc 5.x has _GLIBCXX_USE_CXX11_ABI=0 in the specs
>>>
>>>> This of course assumes both GCC versions are configured to be
>>>> compatible, i.e. you're not using --enable-fully-dynamic-string
>>>
>>> I'm not using many configure options, only
>>> --enable-version-specific-runtime-libs --disable-multilib
>>>
>>>
>>> So if this should work I will try to investigate it further, but I'm not sure
>>> what else I can do.
>>> gdb did not help much because if I recompile mesa without optimizations
>>> the crash does not happen.
>>>
>>> Actually disabling only inline optimization also makes the crash go away.
>>> Given that all invalid free stacks shown by valgrind contain inline functions
>>> from basic_string.h does that ring you any bells?
>>>
>>> Any other tips for debugging this?
>>
>> I'm not sure what to check. If the symbols are equivalent then it
>> shouldn't matter whether a given symbol is inlined using the GCC 4.8.5
>> code or comes from the 5.4.0 shared library. But apparently it does,
>> so either the new library is not backwards compatible, or something
>> else is going on.
>
>
> So I finally got some time to further investigate this issue and I
> found (hopefully)
> the problem. In case someone find similar problem this is what I've done:
>
> - Rebuilt stock gcc 4.8.5 and 5.1.0 on CentOS 6 without stripping binaries
> - Created a dummy FooEngineBuilder on llvm/ExecutionEngine/ExecutionEngine.h
> - Rebuilt both mesa and llvm-mesa-private on CentOS 7 with gcc 4.8.5 and debug
> symbols
>
> FooEngineBuilder is just a class with a std::string member and two methods to
> set the string:
>
>     class FooEngineBuilder {
>     private:
>       std::string MCPU;
>     public:
>       FooEngineBuilder &setMCPUFromHeader() {
>         std::string mymcpu;
>         mymcpu = "my_mcpu";
>         MCPU.assign(mymcpu.begin(), mymcpu.end());
>       }
>       FooEngineBuilder &setMCPUFromSource();
>     };
>
> Using this class on mesa this crashes:
>
>     FooEngineBuilder foo_builder;
>     foo_builder.setMCPUFromHeader();
>
> and this does not:
>
>     FooEngineBuilder foo_builder;
>     foo_builder.setMCPUFromSource();
>
> What happens is that MCPU is an empty string pointing to
> std::string::_Rep::_S_empty_rep_storage defined on the static libstdc++
> (gcc 4.8.5). When assigning MCPU from the header, the _M_dispose method
> from the dynamic library (gcc 5.1.0) is called.
>
> _M_dispose only destroy the string if it's not a reference to
> std::string::_Rep::_S_empty_rep_storage:
>
>     if (__builtin_expect(this != &_S_empty_rep(), false))
>
> The problem is that *this* is pointing to a different
> std::string::_Rep::_S_empty_rep_storage than &_S_empty_rep(), making
> _M_dispose try to delete a static std::string member.
>
> In summary the problem is that static variables are being defined twice,
> exactly why STB_GNU_UNIQUE was created:
>
> https://www.redhat.com/archives/posix-c++-wg/2009-August/msg00002.html
>
> The llvm library is correctly defining the symbols as unique:
>
>     $ objdump -C -T /usr/lib64/libLLVM-3.8-mesa.so | grep _S_empty_rep_st>
>     000000000405be20 u    DO .bss   0000000000000020  Base
> std::string::_Rep::_S_empty_rep_storage
>     000000000405bde0 u    DO .bss   0000000000000020  Base
> std::basic_string<wchar_t, std::char_traits<wchar_t>,
> std::allocator<wchar_t> >::_Rep::_S_empty_rep_storage
>
> But the libstdc++ compiled on CentOS 6 is not:
>
>     $ objdump -C -T $LIBSTDCXX5 | grep _S_empty_rep_storage
>     000000000038c300 g    DO .bss   0000000000000020  GLIBCXX_3.4
> std::string::_Rep::_S_empty_rep_storage
>     000000000038c320 g    DO .bss   0000000000000020  GLIBCXX_3.4
> std::basic_string<wchar_t, std::char_traits<wchar_t>,
> std::allocator<wchar_t> >::_Rep::_S_empty_rep_storage
>
> So in conclusion when building gcc I need to make sure that libstdc++.so is
> defining STB_GNU_UNIQUE symbols.
>
> Maybe this should be mentioned on some gcc/libstdc++ docs related to binary
> compatibility?

Hi Jonathan,

Me again. So I thought I had solved the problem making sure that my libstdc++
was using STB_GNU_UNIQUE

But now I'm facing another crash with a invalid pointer being freed. This time
related to std::locale. The crash happens on locale::_Impl::_M_install_facet.

After debugging I have no idea what would be the right behaviour.
I attached the whole gdb output but here are some highlights
(I ommitted some output so the lines don't break).

The locale stuff is first initialized on libstdc++.so.6 (gcc 5.4.0).
This is part of the stack with a breakpoint on
"if (__index > _M_facets_size - 1)" :

  #0  std::locale::_Impl::_M_install_facet at locale.cc:321
  #1  in std::locale::_Impl::_M_init_facet<... > at locale_classes.h:602
  #2  in std::locale::_Impl::_Impl at locale_init.cc:479
  #3  in std::locale::_S_initialize_once () at locale_init.cc:307
  #4  in pthread_once () from /lib64/libpthread.so.0
  #5  in __gthread_once at gthr-default.h:699
  #6  in std::locale::_S_initialize () at locale_init.cc:316
  #7  in std::locale::locale at locale_init.cc:250

All of this happens on libstdc++.so.6.
Adding a breakpoint on _M_install_facet to print some info:

  b gcc-5.4.0/libstdc++-v3/src/c++98/locale.cc:321
  command 1
    print __index
    print _M_facets_size
    continue
  end

shows _M_facets_size = 46 and __index goes from 0 to 29.

But at some point when boost::lexical_cast function is used,
std::locale from libLLVM-3.8-mesa.so (gcc 4.8.5) is used,
making locale stuff being initialized again:

  #0  std::locale::_Impl::_M_install_facet at locale.cc:319
  #1  in std::locale::_Impl::_M_init_facet<... > at locale_classes.h:564
  #2  in std::locale::_Impl::_Impl at locale_init.cc:397
  #3  in std::locale::_S_initialize_once at locale_init.cc:267
  #4  in pthread_once () from /lib64/libpthread.so.0
  #5  in __gthread_once gthr-default.h:699
  #6  in std::locale::_S_initialize () at locale_init.cc:276
  #7  in std::locale::locale at locale_init.cc:210
  #8  in boost::detail::lcast_put_unsigned<std::char_traits<char>,
         unsigned long, char>::convert  lcast_unsigned_converters.hpp:95

All of the above C++ execution happens on libLLVM-3.8-mesa.so.
Adding a breakpoint like the previous one:

b gcc-4.8.5/libstdc++-v3/src/c++98/locale.cc:319
  command 2
    print __index
    print _M_facets_size
  continue
end

shows _M_facets_size = 28 and __index going from 0 to 1 and then
suddenly jumping to 30. That's when the crash happens. It allocates
a new facet vector and try to delete the old one.

locale.cc:352 causes the crash:

  delete [] __oldf;

So it seems a lot of things are going wrong:

1 - Should it be safe to call _S_initialize_once on both libraries?

2 - Is the "if (__index > _M_facets_size - 1)" branch executed
on normal circumstances?

3 - On another test with only a main.cpp and linking only to
libstdc++.so I tried to force the code inside this "if" to be
executed, doing "set __index = _M_facets_size" on gdb and
the result is the same crash.
Should delete [] __oldf even work?

PS.: Unfortunately I couldn't come up with a simple example
to reproduce. All examples I tried were only executing locale
code from libstdc++.so.
Attachment:
gdb_output

Description: Binary data
Attachment:
trace_locale.gdb

Description: Binary data