On Thu, Sep 08, 2016 at 09:29:58AM +0200, Johannes Schindelin wrote: > sorry for the late answer, I was really busy trying to come up with a new > and improved version of the patch series, and while hunting a bug I > introduced got bogged down with other tasks. No problem. I am not in a hurry. > > I always assumed the _point_ of re_search taking a ptr/len pair was > > exactly to handle this case. The documentation[1] says: > > > > `string` is the string you want to match; it can contain newline and > > null characters. `size` is the length of that string. > > > > Which seems pretty definitive to me (that's for re_match(), but > > re_search() is defined in the docs in terms of re_match()). > > Right. The problem is: I *really* want to avoid using GNU-isms. I don't think GNU-isms are a problem if we wrap them to give a nice interface, and if we rely on having compat/regex. But if you mean "I do not want to rely on using compat/regex everywhere", then OK. I can see arguments both for and against using a consistent regex library, but I do not care that much either way myself. > > We can contain this to the existing compat/regexec/regexec.c, and just > > provide a wrapper that is similar to regexec but takes a ptr/len pair. > > But we can do even better than that: we can provide a wrapper that uses > REG_STARTEND where available (which is really the majority of platforms we > care about: Linux, MacOSX, Windows, and even the *BSDs). Where it is not > available, we simply malloc(), memcpy() and append a NUL. Doesn't that make things much _worse_ for people on systems without REG_STARTEND? If we imagine that most regexec calls would operate on a NUL-terminated buffer, then they are now paying the extra malloc and copy for each call to regexec_buf(), even if the buffer was already NUL-terminated (because they have no idea whether it was or not). I think I'd rather just have: #ifndef REG_STARTEND #error "Your regex library sucks. Compile with NO_REGEX=NeedsStartEnd" #endif (or you could just use REG_STARTEND and let the compiler complain, but then the user has to figure out the right knob to twiddle). One other question about REG_STARTEND is: what does it do with NULs inside the buffer? Certainly glibc (and our compat/regex) treat it as a buffer with a particular length and ignore embedded NULs, as we want. But the NetBSD documentation says only: REG_STARTEND The string is considered to start at string + pmatch[0].rm_so and to have a terminating NUL located at string + pmatch[0].rm_eo (there need not actually be a NUL at that location), Besides avoiding a segfault, one of the benefits of regcomp_buf() is that we will now find pickaxe-regex strings inside mixed binary/text files. But it's not clear to me that NetBSD's implementation does this. I guess we can assume it is fine (it is certainly no _worse_ than the current behavior), and if people's platforms do not handle it, they can build with NO_REGEX. -Peff