Hi, everyone This is my proposal for GSoC with the subject "Make pack access code thread-safe". I'm late in schedule but I would like to ask for your comments on it. Any feedback will be highly appreciated. The "rendered" version can be seen here: https://docs.google.com/document/d/1QXT3iiI5zjwusplcZNf6IbYc04-9diziVKdOGkTHeIU/edit?usp=sharing I kindly ask you to read the text at the google docs link, because in the conversion to plain text I noticed it discards some information :( But for those who prefer to comment by email, here it is: Thanks, Matheus Tavares ======================================= Making pack access code thread-safe April, 2019 #Contact Info Name Matheus Tavares Bernardino Timezone GMT-3 Email matheus.bernardino@xxxxxx IRC Nick matheustavares on #git-devel Telefone [...] Postal address [...] Github https://github.com/MatheusBernardino/ Gitlab https://gitlab.com/MatheusTavares # About me I’m a senior student at the University of São Paulo (USP), attending the Bachelor’s degree in Computer Science course. Currently, I’m at the end of a one year undergraduate research in High-Performance Computing. The goal of this project was to accelerate astrophysical software for black hole studies using GPUs. Also, I’m working as a teaching assistant on IME-USP’s Concurrent and Parallel Programming course, giving lectures and developing/grading programming assignments. Besides parallel and high-performance computing I’m very passionate about software development in general, but especially low-level coding, and FLOSS. # About me and FLOSS ## Linux Kernel Last year, I started contributing to the Linux Kernel in the IIO subsystem, together with a group of colleagues. I worked with another student, to move the ad2s90 module out of staging area to Kernel’s mainline, which we accomplished by the end of the year. In total, I authored 11 patches and co-authored 3 (all of which are already at Torvald’s repo). If you want to know more about my contributions to Linux Kernel, take a look at the Appendix section. ## FLUSP: FLOSS at USP After the amazing experience contributing to the Linux Kernel, we decided to found FLUSP: FLOSS at USP, a group opened to undergraduate and graduate students that aims to contribute to FLOSS software. Since then, the group has grown and evolved a lot: Currently, we have members contributing to the Kernel, GCC, IGT GPU Tools, Git and some projects of our own such as KernelWorkflow. And as a recognition of our endeavor with free software, we received some donations from AnalogDevices and DigitalOcean. Besides administrative questions and contributions to FLOSS projects, at FLUSP, I’ve been mentoring people who want to start contributing to the Linux Kernel and now, to Git, as well. # About me and Git I joined Git community in February and, so far, I have sent the following patches: clone: test for our behavior on odd objects/* content clone: better handle symlinked files at .git/objects/ dir-iterator: add flags parameter to dir_iterator_begin clone: copy hidden paths at local clone clone: extract function from copy_or_link_directory clone: use dir-iterator to avoid explicit dir traversal clone: Replace strcmp by fspathcmp And three more patches for git.github.io: rn-50: Add git-send-email links to light readings SoC-2019-Microprojects: Remove git-credential-cache SoC-2019-Microprojects: Remove all trailing spaces Participating at FLUSP, I’ve also been part of some Git related activities: * I actively helped to organize a Git workshop for newcomer students. * I’ve written an article at our website to help people configure and use git-send-email to send patches. * I’ve been writing a ‘First steps at Git’ article (not finished yet), in which I’m registering what I’ve learned in the Git community so far, since downloading the source, subscribing to the mailings list and joining the channel at IRC until how to use travis-ci and begin sending patches. # The Project As direct as possible, the goal with this project is to make more of Git’s codebase thread-safe, so that we can improve parallelism in various commands. The motivation behind this are the complaints from developers experiencing slow Git commands when working with large repositories[1], such as chromium and Android. And since nowadays, most personal computers have multi-core CPUs, it is a natural step trying to improve parallel support so that we can better use the available resources. With this in mind, pack access code is a good target for improvement, since it’s used by many Git commands (e.g., checkout, grep, blame, diff, log, etc.). This section of the codebase is still sequential and has many global states, which should be protected before we can work to improve parallelism. ## The Pack Access Code To better describe what the pack access code is, we must talk about Git’s object storing (in a simplified way): Besides what are called loose objects, Git has a very optimized mechanism to compactly store objects (blobs, trees, commits, etc.) in packfiles[2]. These files are created by[3]: 1. listing objects; 2. sorting the list with some good heuristics; 3. traversing the list with a sliding window to find similar objects in the window, in order to do delta decomposing; 4. compress the objects with zlib and write them to the packfile. What we are calling pack access code in this document, is the set of functions responsible for retrieving the objects stored at the packfiles. This process consists, roughly speaking, in three parts: 1. Locate and read the blob from packfile, using the index file; 2. If the blob is a delta, locate and read the base object to apply the delta on top of it; 3. Once the full content is read, decompress it (using zlib inflate). Note: There is a delta cache for the second step so that if another delta depends on the same base object, it is already in memory. This cache is global; also, the sliding windows, are global per packfile. If these steps were thread-safe, the ability to perform the delta reconstruction (together with the delta cache lookup) and zlib inflation in parallel could bring a good speedup. At git-blame, for example, 24%[4] of the time is spent in the call stack originated at read_object_file_extended. Not only this but once we have this big section of the codebase thread-safe, we can work to parallelize even more work at higher levels of the call stack. Therefore, with this project, we aim to make room for many future optimizations in many Git commands. # Plan I will probably be working mainly with packfile.c, sha1-file.c, object-store.h, object.c and pack.h, however, I may also need to tackle other files. I will be focusing on the following three pack access call chains, found in git-grep and/or git-blame: read_object_file → repo_read_object_file → read_object_file_extended → read_object → oid_object_info_extended → find_pack_entry → fill_pack_entry → find_pack_entry_one → bsearch_pack and nth_packed_object_offset oid_object_info → oid_object_info_extended → <same as previous> read_object_with_reference → read_object_file → <same as previous> Ideally, at the end of the project, it will be possible to call read_object_file, oid_object_info and read_object_with_reference with thread-safety, so that these operations can be, latter, performed in parallel. Here are some threads on Git’s mailing list where I started discussing my project: * https://public-inbox.org/git/CAHd-oW7onvn4ugEjXzAX_OSVEfCboH3-FnGR00dU8iaoc+b8=Q@xxxxxxxxxxxxxx/ * https://public-inbox.org/git/20190402005245.4983-1-matheus.bernardino@xxxxxx/#t And also, a previous attempt to make part of the pack access code thread-safe which I may use as a base: * https://public-inbox.org/git/20140212015727.1D63A403D3@xxxxxxxxxxxxxxxxxxxxxxxxx/#Z30builtin:gc.c # Points to work on * Investigate pack access call chains and look for non-thread-safe operations on then. * Protect packfile.c read-and-write global variables, such as pack_open_windows, pack_open_fds and etc., using mutexes. * Just like the previous item, protect sha1-file.c global states such as the object cache used by read_object_file(). (The object cache may be thread-local thought. This should still be studied.) * Investigate the delta cache, sliding pack window, and maybe other states that should be protected as well. * Use GDB or GPROF to follow call chains inside pack access looking for functions with static variables in their scopes. These variables are thread-shared and should be protected or the functions to which they belong be refactored. * Make sure tests cover functions I’ll be working on and refactor/add tests as needed * [Bonus] Once pack access is thread-safe, refactor the critical sections at git-grep to use more fine-grained mutexes. This will hopefully increase git-grep performance, especially in large repositories. * [Bonus] Check other mutex protected functions git-grep uses, not related with pack access, to see if we can implement a more fined-grained parallelism there. This functions are: fill_textconv, is_submodule_active, repo_submodule_init, repo_read_gitmodules and add_to_alternates_memory. * [Bonus] Once pack access is thread-safe, ensure xdiff code used by git-blame has thread-safety. I expect this to be easier. * [Bonus] If the previous bonus get completed, start discussing a possible parallel git-blame implementation with the community. We could work a producer-consumer mechanism at blame.c’s assign_blame() function, for a very good work sharing assignment (90% of git-blame’s time is spent here[5]). Or try threading at lower functions on the call stack that still uses a lot of execution time such as the libxdiff ones. # Schedule This is the planned schedule in which I should be working on. But I would like to highlight that since there’s still a significant investigation period from now until early May, this can have some changes or additions during the process. Timeline: Investigation Time (Now - May, 5) * Gather information of global states. * Trace pack access call chain used by git commands like blame and checkout. * Try to classify which global variables are updated during pack access call stack and, therefore, should be protected. * Adjust the schedule as needed. Community Bounding and work on sha1-file.c global states (May, 6 - 26[6]) * Talk with the community about my then refined plan and ask for comments. * Protect object cache at sha1-file.c (or make it thread-local). * Work on other sha1-file.c global states and non-thread-safe functions. Work on packfile.c global states (not including delta cache) (May, 27 - June, 23) * Conclude any unfinished work on sha1-file.c from the previous period. * Protect packfile.c variables (pack_open_windows, pack_open_fds and etc.) and work on its non-protected functions. Work on delta cache and other global states (June, 24 - July, 21) * Protect delta base cache operations (here we should study whether to add mutexes to the cache itself or to the underlying hashmap). * Protect oid_* functions. * Work on sliding window (this is the section I have less knowledge yet, so should be studied) Work on bonus and leftovers (July, 22 - August 19) * This time will be reserved to finish any leftovers from the other periods and, if we still have some spare time, work on the bonus items. * Note: I also plan to attend DebConf from july 21th to 28th # Availability My university vacations start on June 29, but since this is my last year and I’m attending just two courses plus the teaching assistance, I think it won’t be a problem. Also, the classes start back in early August, but I won’t have any more courses to attend in this next semester. I don’t have any schedule trips besides DebConf, from July 21th to 28th (let me know if any other Git community members plan to attend too, please). All changes in availability will be communicated to the mentors in advance. # Project Relevance and after GSoC plans As already pointed out, this project will make it feasible to improve (or add) parallelism in many Git commands. And that’s what I plan to do after (or even during) GSoC, mainly with git-blame and git-grep. I’m also trying to form a local community at FLUSP to keep contributing to Git. Appendix Patches at Linux Kernel staging: iio: ad2s1210: fix 'assignment operator' style checks staging:iio:ad2s90: Make read_raw return spi_read's error code staging:iio:ad2s90: Make probe handle spi_setup failure staging:iio:ad2s90: Remove always overwritten assignment staging:iio:ad2s90: Move device registration to the end of probe staging:iio:ad2s90: Add IIO_CHAN_INFO_SCALE to channel spec and read_raw staging:iio:ad2s90: Check channel type at read_raw staging:iio:ad2s90: Add device tree support staging:iio:ad2s90: Remove spi setup that should be done via dt staging:iio:ad2s90: Add max frequency check at probe dt-bindings:iio:resolver: Add docs for ad2s90 staging:iio:ad2s90: Replace license text w/ SPDX identifier staging:iio:ad2s90: Add comment to device state mutex staging:iio:ad2s90: Move out of staging ________________ [1] Some of them can be seen here: https://groups.google.com/a/chromium.org/forum/#!topic/chromium-dev/oYe69KzyG_U https://bugs.chromium.org/p/git/issues/detail?id=18 https://bugs.chromium.org/p/git/issues/detail?id=16 https://code.fb.com/core-data/scaling-mercurial-at-facebook/ https://public-inbox.org/git/CA+TurHgyUK5sfCKrK+3xY8AeOg0t66vEvFxX=JiA9wXww7eZXQ@xxxxxxxxxxxxxx/ https://public-inbox.org/git/20140213014229.GE4582@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ https://public-inbox.org/git/CACBZZX6A+35wGBYAYj7E9d4XwLby21TLbTh-zRX+fkSt_e2zeg@xxxxxxxxxxxxxx/ [2] https://git-scm.com/book/en/v2/Git-Internals-Packfiles [3] https://github.com/git/git/blob/master/Documentation/technical/pack-heuristics.txt [4] https://i.imgur.com/XmyJMuE.png [5] https://i.imgur.com/XmyJMuE.png [6] GSoC’s official start date