On Mon, Apr 8, 2019 at 6:26 AM Philip Oakley <philipoakley@xxxxxxx> wrote: > > On 08/04/2019 02:23, Duy Nguyen wrote: > > On Mon, Apr 8, 2019 at 5:52 AM Christian Couder > > <christian.couder@xxxxxxxxx> wrote: > >>> Git has a very optimized mechanism to compactly store > >>> objects (blobs, trees, commits, etc.) in packfiles[2]. These files are > >>> created by[3]: > >>> > >>> 1. listing objects; > >>> 2. sorting the list with some good heuristics; > >>> 3. traversing the list with a sliding window to find similar objects in > >>> the window, in order to do delta decomposing; > >>> 4. compress the objects with zlib and write them to the packfile. > >>> > >>> What we are calling pack access code in this document, is the set of > >>> functions responsible for retrieving the objects stored at the > >>> packfiles. This process consists, roughly speaking, in three parts: > >>> > >>> 1. Locate and read the blob from packfile, using the index file; > >>> 2. If the blob is a delta, locate and read the base object to apply the > >>> delta on top of it; > >>> 3. Once the full content is read, decompress it (using zlib inflate). > >>> > >>> Note: There is a delta cache for the second step so that if another > >>> delta depends on the same base object, it is already in memory. This > >>> cache is global; also, the sliding windows, are global per packfile. > >> Yeah, but the sliding windows are used only when creating pack files, > >> not when reading them, right? > > These windows are actually for reading. We used to just mmap the whole > > pack file in the early days but that was impossible for 4+ GB packs on > > 32-bit platforms, which was one of the reasons, I think, that sliding > > windows were added, to map just the parts we want to read. > > Another "32-bit problem" should also be expressly considered during the > GSoC work because of the MS Windows definition of uInt / long to be only > 32 bits, leading to much of the Git code failing on the Git for Windows > port and on the Git LFS (for Windows) for packs and files greater than > 4Gb. https://github.com/git-for-windows/git/issues/1063 Thanks for pointing it out. I didn't get it, thought, if your suggestion was to also propose tackling this issue in this GSoC project. Was it that? I read the link but it seems to be a kind of unrelated problem from what I'm planing to do with the pack access code (which is tread-safety). I may have understood this wrongly, though. Please, let me know if that's the case :) > Mainly it is just substitution of size_t for long, but there can be > unexpected coercions when mixed data types get coerced down to a local > 32-bit long. This is made worse by it being implementation defined, so > one needs to be explicit about some casts up to pointer/memsized types. > >>> # Points to work on > >>> > >>> * Investigate pack access call chains and look for non-thread-safe > >>> operations on then. > >>> * Protect packfile.c read-and-write global variables, such as > >>> pack_open_windows, pack_open_fds and etc., using mutexes. > >> Do you want to work on making both packfile reading and packfile > >> writing thread safe? Or just packfile reading? > > Packfile writing is probably already or pretty close to thread-safe > > (at least the main writing code path in git-pack-objects; the > > streaming blobs to a pack, i'm not so sure). > -- > Philip