Hi Shubham, On Sun, Apr 10, 2022 at 12:13:50AM +0530, Shubham Mishra wrote: > Initial draft proposal for reachability bitmap Thanks so much for your interest in this project and the time you've spent putting together this proposal. Sorry that it has taken me a few days to get back to you. Let's take a look... (I should mention that this shouldn't be reviewed to be picked up by Junio, so it would be fine to just send the contents of "proposal.txt" as an email to the list directly instead of a patch. But it's fine to review it either way, just stating this in case other readers aren't sure). > About Me > > I am Shubham, I am currently working as a Software Engineer at > Microsoft India. I am a 2021 graduate from Delhi University. I am > passionate about core engineering and backend technologies. I love to > see beyond all abstractions and how things really work under the hood. > So, I can work from their roots and make things better. I feel > engineering is all about the tradeoffs that we make and I am trying to > learn them to become a better Engineer. > > I am passionate about open source technologies and have quite a good > amount of contribution to them, I participated in GSoC 2020 with > [KDE](https://summerofcode.withgoogle.com/archive/2020/projects/6473982317953024), > did Internship with Linux Foundation - > [HDV](https://wiki.lfnetworking.org/display/LN/HDV), Season of KDE - > [report](https://community.kde.org/SoK/2020/StatusReport/Shubham), and > I am doing voluntary contributions to VSperf, BoostC++ and some other > open-source projects. Great, that all sounds like terrific experience! Typically GSoC only allows accepted contributors to participate once, but an exception has been made for participants in the 2020 and 2021 cycles, so you would be eligible here under those rules. https://developers.google.com/open-source/gsoc/faq#what_are_the_eligibility_requirements_for_participation > Project Abstract > > During repository clones, the Git server needs to find out all the > objects which clients do not have and need to be sent to the client. > > To make the process faster, Git uses bitmaps to quickly find all the > related objects from an object. Bitmap approach is a performance > optimization over the legacy "Counting Objects" - the process in which > the git server used to iterate through the graph from branch tips to > the beginning of history to list down all objects that need to be > sent. > > bitmap made reachability faster but uncompressed bitmaps can cost a > lot of extra storage. Git uses a C ported version of > [EWAHBoolArray](https://github.com/lemire/EWAHBoolArray) to compress > bitmaps which get stored in the ".bitmap" file with the same prefix > "sha" as ".pack" and ".idx". Right; there are more than 8.5M objects in my (very outdated copy of linux.git), so it would be impractical to store an 8.5M-bit array for each selected commit. > The aim of the project is to design a performance test suite as well > as do the necessary changes to improve bitmap performance by trying > out a new compression scheme that can make read operations along with > other common operations like intersect, union and negate faster. Well stated. Like I mentioned in another similar proposal, I think validating our assumption that the performance of bitmap decompression is worth pursuing is a good first step. We may find that using a different compression scheme: - makes it faster to read bitmaps (by taking less time to decompress them) - makes it faster to write them (by taking less time to compress the bit-arrays) - use less memory for one, the other, or both of the above , alternatively, we may find that switching EWAH for something else doesn't make a meaningful dent in any of those areas, in which case we can find something else to work on, too ;-). > Me & Git: > > ## Microproject: > > I worked on the microproject "Avoid pipes in git related commands in > test scripts", the patches for it has been merged to master now > > * https://public-inbox.org/git/20220224054720.23996-3-shivam828787@xxxxxxxxx/ > * https://public-inbox.org/git/20220224054720.23996-3-shivam828787@xxxxxxxxx/ > > I run a pattern matching grep to find all git commands on LHS of pipes > and fix all of them from file t001-t050. > > As an outcome of this process, I got to learn the code review process > at git work, which is quite cool and different from other > organization's I contributed to before. > > I learned about building source code, running tests, using email to > send patches, communicating with reviewers and sending the next patch > version process. Terrific; I was glad to see that you were able to work on modernizing those test scripts by removing git invocations on the left-hand side of a pipe. Hopefully that gave you some good hands-on experience with our development process, which it sounds like it did. > ## Current understanding: > > - I have gone through git internals, and I well understood about the > pack files as well as the difference between git objects (tree, blob, > commit). > > - I have gone through some documentations - "MyFirstObjectWalk", etc. > it was a good hands-on to get some glimpse of general object related > tasks. > > - I understand how bitmap works in general, I have got some idea how > EWAH compression works and also I have gone through the research paper > on roaring run. > > - I played with commands of pack-object - "git pack-objects dir > --progress < obj_lists.txt" and read the code of related files > "pack-bitmap.c" and parts of "pack-object.c" Great; if you are looking for more areas to dig in, I'd recommend reading: - pack-bitmap-writer.c - midx.c (the bitmap-related parts) - and spending extra time with pack-objects.c (as you mention you already have, but this is the best place to look for how pack-objects uses bitmaps to accelerate its work). There are other parts of the code that use bitmaps, too, for some more straightforward tasks. See for e.g., the parts of builtin/rev-list.c that match a search for "bitmap". > Execution plan: > > I am interested in keeping my primary focus on "building a performance > suite and improving bitmaps performance by finding a better > compression scheme" project and if I finish this early or even after > the GSoC timeline, I will be happy to contribute to other tasks too. I think that's a good strategy. Like I mentioned in the other proposal, too, the goal behind having a few different sub-projects to pick from is that it gives you some additional flexibility to choose what is most interesting to you. It definitely isn't meant to suggest that we should do all of the sub-projects within, since I think any one of them done right could easily fill up the entire GSoC contribution period. > [...] I do not have enough knowledge yet to figure out how > compatible croaring is with the current .bitmap format. We might need > to make changes in the current .bitmap format accordingly. Good summary in the parts I elided with the [...] in the quoted section above. I think that we will likely want to introduce a new bitmap format. We *could* use the existing format, and indicate to older readers that we are using a newer format by incrementing the version identifier and otherwise leaving the format unchanged. But I would like to get away from the existing .bitmap format, since the way we read the hash-cache extension makes it difficult to add things like a variable-width table of contents. So it would be nice to make the .bitmap format compatible with the new-ish chunk-format API that Stolee worked on a year or two ago. Of course, this will be a not-insignificant project in and of itself. It also carries some notable downside that JGit won't be able to read the newer format of bitmaps. We could consider expanding the GSoC project to help JGit add support for that, but I don't think it's strictly necessary for you to take that on, since JGit should gracefully ignore the unknown version number. > Steps I will be following to accomplish the task- > > 1\. Get a better understanding of bitmap related functionalities/ > codebase, EWAH, roaring+run techniques. > > 2\. try to build out an initial draft version implementing only > minimal required core changes, I will try to get a review on it from a > wider audience (including mentors). > > 3\. make changes according to the comment and repeat the review > process. > > 4\. build a performance suite for the changes I have done in the above > steps. It may be worth making this occur after the first step in your proposal, since it removes bias from when you design the performance tests and keeps us honest about whether or not our changes are worth making. (Please don't worry about the possibility of replacing EWAH compression not helping bitmap performance; if it doesn't, there is no shortage of other things that we can work on in this area ;-)). > 6\. Until this time, I will also get a good understanding of the > bitmap related projects, so if we will be able to make good progress > on roaring+run. I can start picking other subprojects too like 'table > of contents' for the .bitmap file where past work - > [https://lore.kernel.org/git/YNuiM8TR5evSeNsN@nand.local/](https://lore.kernel.org/git/YNuiM8TR5evSeNsN@nand.local/) > can act as a good reference to me or/and 'append-only bitmap > generation' subproject. I think it makes sense to pick one or the other, since both will be fairly involved. Like I said earlier, I think we'll have to investigate a new .bitmap format for the "table of contents" project, so we may want to dedicate some time to sketching out what that would look like. > Timeline > > I am available to dedicate around 30-35 hours every week to the project. That seems reasonable to me. > Community bonding periods - > > 1. Exploring code base mostly related to bitmaps > 2. research on other bitmap compression techniques > 3. reading technical documents > 4. interacting with mentors to understand them and project in more detail. > > 12 June - 25 July: > > 1. Writing the first version with the new compression technique > 2. Get the initial version reviewed by reviewers and make changes accordingly. > 3. write performance test for these changes > > 25 July - 4 sept > > 1. if tests result well, extending the above functionality to completely move to a new technique. > 2. writing more performance tests. > 3. start picking up other tasks if time left > > Sept 5 - Sept 12 > > 1. I will make sure to get all changes merged before this week including tests > 2. if not, make a decision with mentior on extending the project This looks reasonable-ish to me, though I would suggest a couple of changes: - Dedicate some time to investigating whether or not a new .bitmap format is required up front, adjusting the rest of the schedule accordingly if it is. - Move the writing the performance tests up much earlier, since I think it makes sense to start there, rather than to wait until the end to start evaluating our changes (and it removes some degree of bias, like I mentioned earlier). I'm not too attached to the particular dates, TBH, and much more interested in the general pace and priority of things. > Blog > > I want to make blog writing a habit so I planned to publish biweekly > blogs at [https://shubham828.github.io/](https://shubham828.github.io/) > during this GSoC and after that. This is something I started during my > last GSoC too but unfortunately couldn't continue post-gsoc. This GSoC > gives me another opportunity to become a regular blogger. Wonderful, I'll look forward to reading your blog posts in this area if we end up working on it. > After GSoC > > I would love to be a regular contributor of Git. After GSoC, I can > pick any left out subprojects of bitmap reachability and I would also > be happy to extend my knowledge beyond bitmaps to learn and contribute > to other parts of git too. That's great! Thanks, Taylor