This proposal was originally written by Kyle Lippincott, but he’s currently on vacation for the next two weeks so I’m helping start this discussion for him (from here on out Kyle is the “I”). TL;DR: I'm proposing that when creating a library for code from the Git codebase, we have two interfaces to this library: the "internal" one that the rest of the Git codebase uses, and the "external" one for use by other projects. The external interface will have a different coding style and platform support than the rest of the codebase. When thinking about potential issues and complications with libification, I encountered a few broad categories of issues, and I'd like to list them briefly (edit: turns out I can't be brief to save my life) and float a proposal that may help minimize them. Definitions ----------- - When I say "Git" or "the git executable/binary" or whatever, I'm referring to "the collection of binaries, tests, etc. that are part of the main git repo" unless I say otherwise. - Similarly, when I say "internal" I mean "for use by <that collection of programs>". When I say "external" I mean for use by stuff that's not part of the Git repository. Assumptions ----------- - Libraries that we're providing can be either statically or dynamically linked. Git will link statically to its own Git libraries. External projects may use either. - Git must continue to be compilable and usable on all platforms it's currently supported on. Libification can't take that away. However, since libification is producing new interfaces for new use cases, there is no requirement that we make these new interfaces usable on all platforms, especially at first. - We'd like as little churn and "uglification" of the main codebase as possible. Issues ------ - Symbol name collisions: Since C doesn't have namespacing or other official name mangling mechanisms, all of the symbols inside of the library that aren't static are going to be at risk of colliding with symbols in the external project. This is especially a problem for common symbols like "error()". - Header files: This is actually several related problems: - Git codebase's header files assume that anything that's brought in via <git-compat-util.h> is available; this includes system header files, but also macro definitions, including ones that change how various headers behave. Example: _GNU_SOURCE and _FILE_OFFSET_BITS=64 cause headers like <unistd.h> to change behavior; _GNU_SOURCE makes it provide different/additional functionality, and _FILE_OFFSET_BITS=64 makes types like `off_t` be 64-bit (on some platforms it might be 32-bit without this define). - <git-compat-util.h> is expected to be included as the first header file in the translation unit, so as to make _GNU_SOURCE and similar #defines have the desired effect. If a translation unit (in an external library consumer) has already included <unistd.h>, we can't rely on them having had _GNU_SOURCE defined ahead of time - We can't just `#include <git-compat-util.h>` at the top of our external interface headers, - Git's header files make regular use of inlining. We can't assume that external projects are going to use static linking, and we can't assume that external projects are going to use a C-compatible language (they might not use our header files at all), so inline functions seem risky at the interface layer. - Compatibility: Using code from the git codebase as a library is a new use case, we do not have the backwards compatibility requirements that we do for Git itself. We should take full advantage of this, and explicitly state what compatibility guarantees we are providing (or not providing). Proposal -------- Let's have a distinction between the "internal" interface (used by Git), and the "external" interface (used by everyone else). The "external" interface has several differences from the rest of the git codebase: - Minimal. Only include symbols and types that we explicitly want to be part of the interface - This is both for API evolution abilities and providing a "well-lit path" to usage. Internal header files may have a lot of similar but slightly different functions that can be very confusing, or are highly specialized. - Most languages will not be able to include our headers. Reducing the interface to the minimal necessary means it's easier to identify when the interfaces change and update the non-C-compatible-language bindings. - The external interface should have as little code/new functionality as possible. All actual functionality should be in the internal interface(s). - No inline functions. This is similar to minimal. We should put as little as possible in the header files, especially since many use cases involve using the library from a language that can't even #include them at all. - Self Contained. The header files must work if they are the first/only #include in the external project. They must include everything they need, and not assume it was already handled for them. - Tolerant. The header files probably won't be the first/only #include in the external project's translation unit, and they should still work. This means not using types like `off_t` or `struct stat` in the interfaces provided, since their sizes are dependent on the execution environment (what's been included, #defines, CFLAGS, etc.) - Non-interfering. Our header files must not change fundamental things about the execution environment. This means they must not do things like #define _GNU_SOURCE or #define _FILE_OFFSET_BITS=64 - Limited Platform Compatibility. The external interfaces are able to assume that <stdint.h> and other C99 (or maybe even C11+) functionality exists and use it immediately, without weather balloons or #ifdefs. If some platform requires special handling, that platform isn't supported, at least initially. - Non-colliding. Symbol names in these external interface headers should have a standard, obvious prefix as a manual namespacing solution. Something like `gitlib_`. (Prior art: cURL uses a `curl_` prefix for externally-visible symbols, libgit2 uses `git_<module_name>_foo` names; for internal symbols, they use `Curl_` and `git_<module>__foo` (with a double underscore), respectively) - Translating. The external interface provides "external" symbol names, and potentially more compatible function interfaces than the internal interface does, and exists to translate from one domain to another. Most functions in the external interface will be just a single call to the internal interface. Examples: - Internal interface is `void foo();`; external interface would be `void gitlib_foo() { foo(); }` - Internal interface is `void foo(off_t val);`; external interface could be `void gitlib_foo(int64_t val) { foo(val); }` -- here we accept int64_t instead of off_t due to the issues around the size of off_t - Internal interface is `void foo(strbuf *s);`; the external interface might be `void gitlib_foo(char *s, size_t s_len) { strbuf sb; strbuf_init(&sb, s_len + 1); strbuf_add(&sb, s, s_len); foo(&sb); } ` -- since strbufs own the memory they hold, strings that come via the external interface might need to be copied to be memory safe. - Internal interface was `void foo();` but gained a new parameter. We don't need to expose this parameter in the external interface, and instead can just use a sensible default. External interface can remain `void gitlib_foo() { foo(NULL); }` Proof of Concept ---------------- I think we should continue with the git-std-lib work as a manual separation of the .c files and associated header files that comprise the very lowest level of functionality in the git codebase. This manual separation would only produce a library with an "internal" interface. We should also start to apply these ideas by defining an "external" interface which has a subset of the functionality in git-std-lib. Automatic symbol hiding ----------------------- One of the main driving forces behind my proposal above is avoiding significant churn in the git codebase, for example needing to rename every function in the codebase that's not static. While many function names are unlikely to collide, such as `parse_oid_hex`, others are significantly more likely, like `error` or `hex_to_bytes`. Needing to rename all "plausible" collisions to things that are unlikely to collide, like `GIT_error` or `GIT_hex_to_bytes` is tedious, error prone, and unpleasant. I have possibly discovered a truly remarkable solution, but this footnote is too small to contain it. Wait, no it's not. This isn't fully tested yet, but has shown promise in my initial tests using clang on a Linux machine. - Compile the "internal" interface(s) and all supporting code with `-fvisibility=hidden` to produce .o files for each .c file - Compile the "external" interface(s) without hiding the symbols - Produce a .a file that contains that code, for use by git itself - "Partially link" the everything, using `ld -r`, to produce a single .o file - Use `objcopy --localize-hidden` to actually hide the internal symbols from the "partially linked" .o file This should leave us with two static libraries: one that has the symbols marked as "hidden" but still usable, for use in git itself, and one that contains the external interface, but doesn't expose the hidden symbols. There may be similar solutions possible on other platforms, or there may not, and we may need to do the great renaming (either in the code itself, or via something like a giant set of linker scripts). While my proposal to have a separation between the internal and external interfaces is a requirement for making this automatic symbol hiding solution work, I don't think that a failure to make the automatic symbol hiding solution work means that we shouldn't have the internal/external split. It's only one contributing point in favor of having the internal/external split.