On Fri, Sep 29, 2017 at 10:36 PM, Jonathan Tan <jonathantanmy@xxxxxxxxxx> wrote: > On Wed, 27 Sep 2017 18:46:30 +0200 > Christian Couder <christian.couder@xxxxxxxxx> wrote: >> I don't think single-shot processes would be a huge burden, because >> the code is simpler, and because for example for filters we already >> have single shot and long-running processes and no one complains about >> that. It's code that is useful as it makes it much easier for people >> to do some things (see the clone bundle example). >> >> In fact in Git development we usually start to by first implementing >> simpler single-shot solutions, before thinking, when the need arise, >> to make it faster. So a perhaps an equally valid opinion could be to >> first only submit the patches for the single-shot protocol and later >> submit the rest of the series when we start getting feedback about how >> external odbs are used. > > My concern is that, as far as I understand about the Microsoft use case, > we already know that we need the faster solution, so the need has > already arisen. Yeah, some people need the faster solution, but my opinion is that many other people would prefer the single shot protocol. If all you want to do is a simple resumable clone using bundles for example, then the long running process solution is very much overkill. For example with filters there are people using them to do keyword expansion (maybe to emulate the way Subversion and CVS substitutes keywords like $Id$, $Author$ and so on). It would be really bad to deprecate the single shot filters and tell those people they now have to use long running processes because we don't want to maintain the small code that make single shot filters work. The Microsoft GVFS use case is just one use case that is very far from what most people need. And my opinion is that many more people could benefit from the single shot protocol. For example many people and admins could benefit from resumable clones using bundles and, if I remove the single shot protocol, this use case will be unnecessarily more difficult to implement in the same way as keyword expansion would be unnecessarily more difficult to implement if we removed the single shot filters. See the first article in https://git.github.io/rev_news/2016/03/16/edition-13/ about resumable clone if you are not convinced that resumable clones are not an old and important problem. >> And yeah I could change the order of the patch series to implement the >> long-running processes first and the single-shot process last, so that >> it could be possible to first get feedback about the long-running >> processes, before we decide to merge or not the single-shot stuff, but >> I don't think it would look like the most logical order. > > My thinking was that we would just implement the long-running process > and not implement the single-shot process at all (besides maybe a script > in contrib/). If we are going to do both anyway, I agree that we should > do the single-shot process first. Nice to hear that! >> > And I think that my design can be extended to support a use case in >> > which, for example, blobs corresponding to a certain type of filename >> > (defined by a glob like in gitattributes) can be excluded during >> > fetch/clone, much like --blob-max-bytes, and they can be fetched either >> > through the built-in mechanism or through a custom hook. >> >> Sure, we could probably rebuild something equivalent to what I did on >> top of your design. >> My opinion though is that if we want to eventually get to the same >> goal, it is better to first merge something that get us very close to >> the end goal and then add some improvements on top of it. > > I agree So are you ok to rebase your patch series on top of my patch series? My opinion is that my patch series is trying to get to the end goal, and succeeding to a very large extent, with the smallest amount of deep technical changes as possible, and that it is the right way to approach this problem for the following reasons: 1) The root problem is that the current object stores (packfiles and loose object files) are not good ways to store some objects, especially some blobs. 2) This root problem cannot be dealt with by Git itself without any help from external programs, because Git cannot realistically implement many different object stores (like http servers, artifact stores, etc). So Git must be improved so that it becomes capable of communicating with external object stores. 3) As the Git protocol uses packfiles to send objects and is not very flexible, it might be better if external stores can also be used to transfer objects that are not stored any more in the current object stores. (As packfiles are not good for storing some objects, they are probably also not a good format for sending them. Also, as the Git protocol is not resumable, we might easily be able to implement resumable clones if we let external stores handle some transfer.) 4) Making it easy and flexible to exchange objects (and maybe meta information) with the external stores is very important. 5) Protocol changes are more difficult than many other code changes, so we should care a lot about the protocol between Git and external stores. > - I mentioned that because I personally prefer to review smaller > patch sets at a time, I am ok with sending small patch sets, and I will send smaller patch sets about this from now on. > and my patch set already includes a lot of the > same infrastructure needed by yours - for example, the places in the > code to dynamically fetch objects, exclusion of objects when fetching or > cloning, configuring the cloned repo when cloning, fsck, and gc. I agree that your patch set already includes some infrastructure that could be used by my work, and your patch sets are perhaps implementing some of this infrastructure better than in my work (I haven't taken a deep look). But I really think that the right approach is to focus first on designing a flexible protocol between Git and external stores. Then the infrastructure work should be related to improving or enabling the flexible protocol and the communication between Git and external stores. Doing infrastructure work first and improving things on top of this new infrastructure without relying first on a design of the protocol between Git and external stores is not the best approach as I think we might over engineer some infrastructure work or base some user interfaces on the infrastructure work and not on the end goal. For example if we improve the current protocol, which is not necessarily a bad thing in itself, we might forget that for resumable clone it is much better if we just let external stores and helpers handle the transfer. I am not saying that doing infrastructure work is bad or will not in the end let us reach our goals, but I see it as something that is potentially distracting, or misleading, from focusing first on the protocol between Git and external stores. >> > - I get compile errors when I "git am" these onto master. I think >> > '#include "config.h"' is needed in some places. >> >> It's strange because I get no compile errors even after a "make clean" >> from my branch. >> Could you show the actual errors? > > I don't have the error messages with me now, but it was something about > a function being implicitly declared. You will probably get these errors > if you sync past commit e67a57f ("config: create config.h", 2017-06-15). I am past this commit and I get no errors. I rebased on top of: ea220ee40c "The eleventh batch for 2.15" >> > Any reason why you prefer to update the loose object functions than to >> > update the generic one (sha1_object_info_extended)? My concern with just >> > updating the loose object functions was that a caller might have >> > obtained the path by iterating through the loose object dirs, and in >> > that case we shouldn't query the external ODB for anything. >> >> You are thinking about fsck or gc? >> Otherwise I don't think it would be clean to iterate through loose object dirs. > > Yes, fsck and gc (well, prune, I think) do that. I agree that Git > typically doesn't do that (except for exceptional cases like fsck and > gc), but I was thinking about supporting existing code that does that > iteration, not introducing new code that does that. I haven't taken a look at how fsck and prune work and this is still code that Peff wrote (though a long time ago), so I tend to trust it. But I will take a look, and if it is indeed better for them, I am ok to update sha1_object_info_extended() instead of loose object functions. Thanks.