On 8/16/2017 5:35 PM, Jonathan Tan wrote:
On Wed, 16 Aug 2017 13:32:23 -0700
Junio C Hamano <gitster@xxxxxxxxx> wrote:
Jonathan Tan <jonathantanmy@xxxxxxxxxx> writes:
Also, let me know if there's a better way to send out these patches for
review. Some of the code here has been reviewed before, for example.
[1] https://public-inbox.org/git/cover.1502241234.git.jonathantanmy@xxxxxxxxxx/
[2] https://public-inbox.org/git/ffb734d277132802bcc25baa13e8ede3490af62a.1501532294.git.jonathantanmy@xxxxxxxxxx/
[3] https://public-inbox.org/git/20170807161031.7c4eae50@xxxxxxxxxxxxxxxxxxxxxxxxxxx/
... and some of the code exists only in the list archive, so we
don't know which other topic if any we may want to eject tentatively
if we wanted to give precedence to move this topic forward over
others. I'll worry about it later but help from others is also
appreciated.
Thanks - I can help take a look when it is time to move the code in.
I agree that having this depend on patches elsewhere in the list archive
makes it more difficult to review. I know I like to see things in
context to get a better picture.
I think the issue here is whether we want to move this topic forward or
not, that is, if this (special ".imported" objects) is the best way to
solve (at least partially) the connectivity check part of tolerating
missing objects. I hope that we can continue to talk about it.
I think this topic should continue to move forward so that we can
provide reasonable connectivity tests for fsck and check_connected in
the face of partial clones. I'm not sure the prototype implementation
of reading/parsing all imported objects to build the promised oidset is
the most performant model but we can continue to investigate the best
options.
This collects names of the objects that are _directly_ referred to
by imported objects. An imported pack may have a commit, whose
top-level tree may or may not appear in the same pack, or the tree
may exist locally but not in the same pack. Or the tree may not be
locally available at all. In any of these four cases, the top-level
tree is listed in the "promises" set. Same for trees and tags.
I wonder if all of the calls to oidset_insert() in this function
want to be guarded by "mark it as promised only when the referrent
is *not* locally available" to keep the promises set minimally
populated. The only change needed to fsck in order to make it
refrain from treating a missing but promised object as an error
would be:
- if (object is missing)
+ if (object is missing && object is not promised)
error("that object must be there but missing");
so there is no point in throwing something that we know we locally
have in this oidset, right?
On the other hand, cost of such additional checks in this function
may outweigh the savings of both memory pressure and look-up cost,
so I do not know how the tradeoff would turn out.
I also don't know how the tradeoff would turn out, so I leaned towards
the slightly simpler solution of not doing the check. In the future,
maybe a t/perf test can be done to decide between the two.
+static int is_promise(const struct object_id *oid)
+{
+ if (!promises_prepared) {
+ if (repository_format_lazy_object)
+ for_each_packed_object(add_promise, NULL,
+ FOR_EACH_OBJECT_IMPORTED_ONLY);
+ promises_prepared = 1;
+ }
+ return oidset_contains(&promises, oid);
+}
Somehow I'm tempted to call this function "is_promised()" but that
is a minor naming issue.
Given all we need is an existance check for a given oid, I wonder if it
would be faster overall to do a binary search through the list of
imported idx files + an existence test for an imported loose object.
Especially in the check_connected case which isn't verifying every
object, that should be a lot less IO than loading all the imported
commits, trees and blobs and pre-computing an oidset of all possible
objects. The lookup for each object would be slower than a simple call
to oidset_contains but we avoid the up front cost.
With some caching of idx files and threading, I suspect this could be
made pretty fast.
I was trying to be consistent in using the name "promise" instead of
"promised object/tag/commit/tree/blob" everywhere, but we can switch if
need be (for example, if we don't want to limit the generic name
"promise" to merely objects).
static const char *describe_object(struct object *obj)
{
static struct strbuf buf = STRBUF_INIT;
@@ -410,7 +472,7 @@ static void fsck_handle_reflog_oid(const char *refname, struct object_id *oid,
xstrfmt("%s@{%"PRItime"}", refname, timestamp));
obj->used = 1;
mark_object_reachable(obj);
- } else {
+ } else if (!is_promise(oid)) {
error("%s: invalid reflog entry %s", refname, oid_to_hex(oid));
errors_found |= ERROR_REACHABLE;
}
This is about certainly is one place we want to check if the missing
object is OK, but I'm a bit surprised if this were the only place.
Don't we need "while trying to follow all the outgoing links from
this tree object, and we found this object is not available locally;
normally we would mark it as an error but it turns out that the
missing one is in the promised set of objects, so it is OK" for the
normal connectivity traversal codepaths, for example?
That's right. The places to make this change are the same as those in
some earlier patches I sent (patches 2-4 in [1]).
[1] https://public-inbox.org/git/cover.1501532294.git.jonathantanmy@xxxxxxxxxx/