On Tue, Jun 20, 2023 at 01:25:23PM +0200, Michal Prívozník wrote: > On 6/19/23 17:02, Daniel P. Berrangé wrote: > > On Mon, Jun 19, 2023 at 02:45:57PM +0200, Michal Prívozník wrote: > >> On 6/19/23 14:31, Daniel P. Berrangé wrote: > >>> On Mon, Jun 19, 2023 at 02:03:48PM +0200, Michal Privoznik wrote: > >>>> It's weird that in 2023 there's no reliable and portable way to > >>>> parse strings, but okay. > >>>> > >>>> We started with strtol(). Except, it doesn't work the same on > >>>> Linux and Windows. On Windows it behaves a bit different when it > >>>> comes to parsing strings with base = 0. So we've switched to > >>>> g_ascii_strtoll() which promised to solve this. And it did. > >>>> Except, it introduced another problem. I don't really understand > >>>> why, but I've seen random test failures and all of them claimed > >>>> inability to parse a number (specifically <memory/> from the > >>>> hard coded domain XML from test driver, which IMO is just a > >>>> coincidence because it's the first number we parse). > >>> > >>> What platforms are you seeing the failures on ? All platforms or just > >>> on Windows, or just some other specific one? > >> > >> I've tried only Linux so far. Windows is out of question. > > > > snip > > > >>> On the Linux case get_C_locale is > >>> > >>> static locale_t > >>> get_C_locale (void) > >>> { > >>> static gsize initialized = FALSE; > >>> static locale_t C_locale = NULL; > >>> > >>> if (g_once_init_enter (&initialized)) > >>> { > >>> C_locale = newlocale (LC_ALL_MASK, "C", NULL); > >>> g_once_init_leave (&initialized, TRUE); > >>> } > >>> > >>> return C_locale; > >>> } > >>> > >>> and only way that could fail is if newlocale isn't threadsafe, and > >>> we have other code in libvirt calling newlocale at exact same time > >>> as the *first* call to get_C_locale. > >> > >> Yeah, that's why I don't understand what's going on. Anyway, try running > >> tests repeatedly (you can use oneliner from the commit message) and > >> you'll run into the problem. > > > > Yes, I've hit the problem with the error message > > > > error: XML error: Invalid value '8388608' for element or attribute './memory[1]' > > > > interestingly, I've also hit a few other error messages, some failures > > without error message (I suspect virsh simply crashed with no output), > > and most intriguing is that it 'virshtest' simply hangs when I run it > > directly. > > <snip/> > > > > > So the cause of the hang is exceedingly obvious. > > > > Very few glib APIs are safe to uses in between fork & exec, and we are > > (yet again) being burnt. > > Yep, and switching to plain strtol() fixes this hang. But it does not > solve ... > > > > > What I can't explain, however, is why we sometimes get an error message > > instead of a hang. > > .. this. I bet it has something to do with fork() + exec() because when > I set up logging and run those tests I can see which one is failing > (with that "unable to parse NNN" message), but when I run it manually > with the exact arguments I don't see any hiccups. Ok, I've found the root cause. g_mutex_lock() sometimes splatters errno on contended locks: https://gitlab.gnome.org/GNOME/glib/-/issues/3034 most of the time this won't matter, as the APIs have a return value that identifies failure, and only then does the caller look at errno. The strtoll/g_ascii_strtoll functions are special though because we have to directly inspect errno to identify failure. The splattering of errno breaks this :-( IOW, as of today, the g_ascii_strtoll functions are all broken and unusable on Linux when GLib is built to natively use futex() instead of pthread_mutex_t. We can avoid this, however, by making virInitialize invoke g_ascii_strtoll() on a dummy string. This triggers the glib one-time initializer, and thereafter we shouldn't hit the locking bug because g_once_init_enter will go via a fast-path that avoids the futex syscall. > > Third, we should move virExec and virFork and related helpers into a > > standalone file, so that it is very clear which bits of code are running > > in between fork+exec and thus need careful design to avoid anything > > except async safe POSIX. > > I've tried to do this, but it's bigger task than I thought. Plenty of > dependencies across the file. No problem, not an urgent thing. Just something I think we ought to do at somepoint With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|