On Feb 19, 2015 5:42 PM, David Turner <dturner@xxxxxxxxxxxxxxxx> wrote: > > On Fri, 2015-02-20 at 06:38 +0700, Duy Nguyen wrote: > > > * 'git push'? > > > > This one is not affected by how deep your repo's history is, or how > > wide your tree is, so should be quick.. > > > > Ah the number of refs may affect both git-push and git-pull. I think > > Stefan knows better than I in this area. > > I can tell you that this is a bit of a problem for us at Twitter. We > have over 100k refs, which adds ~20MiB of downstream traffic to every > push. > > I added a hack to improve this locally inside Twitter: The client sends > a bloom filter of shas that it believes that the server knows about; the > server sends only the sha of master and any refs that are not in the > bloom filter. The client uses its local version of the servers' refs > as if they had just been sent. This means that some packs will be > suboptimal, due to false positives in the bloom filter leading some new > refs to not be sent. Also, if there were a repack between the pull and > the push, some refs might have been deleted on the server; we repack > rarely enough and pull frequently enough that this is hopefully not an > issue. > > We're still testing to see if this works. But due to the number of > assumptions it makes, it's probably not that great an idea for general > use. Good to hear that others are starting to experiment with solutions to this problem! I hope to hear more updates on this. I have a prototype of a simpler, and I believe more robust solution, but aimed at a smaller use case I think. On connecting, the client sends a sha of all its refs/shas as defined by a refspec, which it also sends to the server, which it believes the server might have the same refs/shas values for. The server can then calculate the value of its refs/shas which meet the same refspec, and then omit sending those refs if the "verification" sha matches, and instead send only a confirmation that they matched (along with any refs outside of the refspec). On a match, the client can inject the local values of the refs which met the refspec and be guaranteed that they match the server's values. This optimization is aimed at the worst case scenario (and is thus the potentially best case "compression"), when the client and server match for all refs (a refs/* refspec) This is something that happens often on Gerrit server startup, when it verifies that its mirrors are up-to-date. One reason I chose this as a starting optimization, is because I think it is one use case which will actually not benefit from "fixing" the git protocol to only send relevant refs since all the refs are in fact relevant here! So something like this will likely be needed in any future git protocol in order for it to be efficient for this use case. And I believe this use case is likely to stick around. With a minor tweak, this optimization should work when replicating actual expected updates also by excluding the expected updating refs from the verification so that the server always sends their values since they will likely not match and would wreck the optimization. However, for this use case it is not clear whether it is actually even worth caring about the non updating refs? In theory the knowledge of the non updating refs can potentially reduce the amount of data transmitted, but I suspect that as the ref count increases, this has diminishing returns and mostly ends up chewing up CPU and memory in a vain attempt to reduce network traffic. Please do keep us up-to-date of your results, -Martin Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project��.n��������+%������w��{.n��������n�r������&��z�ޗ�zf���h���~����������_��+v���)ߣ�