On 6/19/23 17:02, Daniel P. Berrangé wrote: > On Mon, Jun 19, 2023 at 02:45:57PM +0200, Michal Prívozník wrote: >> On 6/19/23 14:31, Daniel P. Berrangé wrote: >>> On Mon, Jun 19, 2023 at 02:03:48PM +0200, Michal Privoznik wrote: >>>> It's weird that in 2023 there's no reliable and portable way to >>>> parse strings, but okay. >>>> >>>> We started with strtol(). Except, it doesn't work the same on >>>> Linux and Windows. On Windows it behaves a bit different when it >>>> comes to parsing strings with base = 0. So we've switched to >>>> g_ascii_strtoll() which promised to solve this. And it did. >>>> Except, it introduced another problem. I don't really understand >>>> why, but I've seen random test failures and all of them claimed >>>> inability to parse a number (specifically <memory/> from the >>>> hard coded domain XML from test driver, which IMO is just a >>>> coincidence because it's the first number we parse). >>> >>> What platforms are you seeing the failures on ? All platforms or just >>> on Windows, or just some other specific one? >> >> I've tried only Linux so far. Windows is out of question. > > snip > >>> On the Linux case get_C_locale is >>> >>> static locale_t >>> get_C_locale (void) >>> { >>> static gsize initialized = FALSE; >>> static locale_t C_locale = NULL; >>> >>> if (g_once_init_enter (&initialized)) >>> { >>> C_locale = newlocale (LC_ALL_MASK, "C", NULL); >>> g_once_init_leave (&initialized, TRUE); >>> } >>> >>> return C_locale; >>> } >>> >>> and only way that could fail is if newlocale isn't threadsafe, and >>> we have other code in libvirt calling newlocale at exact same time >>> as the *first* call to get_C_locale. >> >> Yeah, that's why I don't understand what's going on. Anyway, try running >> tests repeatedly (you can use oneliner from the commit message) and >> you'll run into the problem. > > Yes, I've hit the problem with the error message > > error: XML error: Invalid value '8388608' for element or attribute './memory[1]' > > interestingly, I've also hit a few other error messages, some failures > without error message (I suspect virsh simply crashed with no output), > and most intriguing is that it 'virshtest' simply hangs when I run it > directly. <snip/> > > So the cause of the hang is exceedingly obvious. > > Very few glib APIs are safe to uses in between fork & exec, and we are > (yet again) being burnt. Yep, and switching to plain strtol() fixes this hang. But it does not solve ... > > > What I can't explain, however, is why we sometimes get an error message > instead of a hang. .. this. I bet it has something to do with fork() + exec() because when I set up logging and run those tests I can see which one is failing (with that "unable to parse NNN" message), but when I run it manually with the exact arguments I don't see any hiccups. > > If I modify 'virGlobalInit' to call g_ascii_strtoll("0", NULL, 10); then > neither the hang or error messages occur, but the hang *should* still > occurr as the mutex locking race fundamentally stil exists. I think it > just changes the timing enough to avoid it in our case. > > My only thought with the error messages is that somehow the 'locale_t' > is getting its initialization missed somehow. > >> I've tried to debug the problem over weekend but was really >> unsuccessful. Firstly, I suspected that glib version of pthread_once() >> was broken. So I've rewritten get_C_locale() to use pthread_once() but >> that didn't help. Which tends to point into direction of glibc, >> supported by the fact that glibc impl of newlocale() does indeed have a >> global RW lock. But I haven't found anything obviously wrong there either. > > If glib was actually using pthread_once() we wouldn't have a problem. > Their g_once_init_enter/leave stuff doesn't use pthread_once() and their > impl is not safe due to its mutex usage. > > > Anyway, I understand why your proposed change here is avoiding problems, > even if I don't understand the full root cause. > > Ultimately I think our virExec() impl is verging on broken by design. We > have got too much functionality in it to be safe. Especially since we > switched from gnulib to glib, we're not so concious of what higher level > constructs we're accidentally relying on - eg no one would guess g_ascii_strtol > would acquire locks. > > Rather than change our virstring.c impl, I'm inclined to think we need to > be more aggressive and eliminate as much use of glib as possible from the > virExec() and virFork() functions. Limit them to pure POSIX only. > > Basically g_new/g_free are the only bits of glib I would entirely > trust in between fork/exec, without analyznig the code of other glib > APIs. > > Specifically for this virStrToLong_i problem, I would suggest that we > just directly call strtol from virCommandMassCloseGetFDsLinux(), and > leave the main virStrToLong_i impl unchanged. I've done this locally and it helped with deadlocks, but not with the number parsing problem. > > Second, we should modify the FreeBSD impl of virCommandMassClose so > that it works on Linux, when close_range() is available. That would > avoid calling the problem code in the first place for modern linux. Yeah, I remember I wanted to do this when the syscall was introduced to the Linux kernel, but for some reason didn't. BTW: glibc has closefrom() which is just a wrapper over close_range(N, ~0U, 0); so we might get the FreeBSD implementation for nothing. > > > Third, we should move virExec and virFork and related helpers into a > standalone file, so that it is very clear which bits of code are running > in between fork+exec and thus need careful design to avoid anything > except async safe POSIX. I've tried to do this, but it's bigger task than I thought. Plenty of dependencies across the file. Michal