Re: [PATCH] hash-object --no-filters

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Aug 02, 2008 at 10:28:13AM -0700, Junio C Hamano wrote:
> Dmitry Potapov <dpotapov@xxxxxxxxx> writes:
> 
> > The --no-filters option makes git hash-object to work as there were no
> > input filters. This option is useful for importers such as git-svn to
> > put new version of files as is even if autocrlf is set.
> 
> I think this is going in the right direction, but I have to wonder a few
> things.
> 
> First, on hash-object.
> 
>  (1) "hash-object --stdin" always hashes literally.  We may want to be
>      able to say "The contents is this but pretend it came from this path
>      and apply the usual input rules", perhaps with "--path=" option;

It makes sense.

> 
>  (2) "hash-object temporaryfile" may want to honor the same "--path"
>      option;

Agreed.

> 
>  (3) "hash-object --stdin-paths" may want to get pair of paths (i.e. two
>      lines per entry) to do the same.

I cannot come up with a good name for this option.

> 
> If we want to do the above, the existing low-level interface needs to be
> adjusted.
> 
> index_pipe() and index_fd() can learn to take an additional string
> parameter for attribute lookup to implement (1) and (2) above.

index_fd already has the 'path' parameter, which is used as hint for
for blob conversion.

> Perhaps
> the string can be NULL to signal --no-filter behaviour, in which case the
> HASH_OBJECT_LITERALLY change may not be necessary for this codepath.

Sounds like a good idea :)

> 
> By the way, why do we have index_pipe() and index_fd() to begin with?  Is
> it because users of index_pipe() do not know what the path it is hashing
> and also the fd being a pipe we cannot mmap it?

index_fd() does not need the path for anything but to choose filters.
So, if index_pipe supported filters, it would have the same parameter.

There is one more parameter that index_fd() has and index_pipe() does
not. It is 'struct stat'. So I decided to look what this parameter is
used for in index_fd(), and it turned out for two things:
- to determine the size that needs to mmap
- to check whether the file is regular and if it is not then skip
  convert_to_git().

That made me wonder whether index_fd() can be ever called for a non-
regular file? I studied the source code and with the exception to git
hash-object, which can pass anything what it can bed opened, in all
other cases, we always call it for what is know as a regular file. In
fact, it could be otherwise. It won't work for non-regular files. It
is quite obvious that git hash-object for a directory will fail, but
I wondered what would happen if I'd give it something different. For
instance, a named pipe (FIFO)

 $mkfifo fifofile
 $git hash-object
 <wait for the other process to start write to it>
 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391

i.e. the same SHA-1 as for an empty file, and here is why: index_fd()
tries to mmap the file descriptor and that obviously fails, but xmmap()
has this particular code:

	if (ret == MAP_FAILED) {
		if (!length)
			return NULL;

apparently, it was workaround for empty files, but because st_size is 0
for pipes, index_fd treats any pipe as empty file!

> 
> If these two are the only reasons, then I wonder if we can:
> 
>  - accept NULL as path and stat parameters for callers without a filename
>    (which automatically implies we are doing a regular blob and we hash
>    literally); and

I like this idea.

> 
>  - first try to mmap(), and if it fails fall back to the "read once into
>    strbuf" codepath to solve mmap-vs-pipe issue.

I have an alternative proposal:

Because we have stat structure given as a parameter, we can always
check whether the file is regular or not. If it is regular, we can use
mmap() and if it is not then use "read once into strbuf" approach.

> I am not sure if such a unification of these two functions is useful,
> though.

I have implemented this unification, and it reduces the code size,
makes git-hash-object to work with named pipes, and makes easier to
add the --path and --no-filters options, because there is no need
to modify the index_fd interface anymore, and there is a single place
where convert_to_git is invoked. So it looks like a good idea.

Here is the patch:

-- >8 --
From: Dmitry Potapov <dpotapov@xxxxxxxxx>
Date: Sun, 3 Aug 2008 08:39:16 +0400
Subject: [PATCH] teach index_fd to work with pipes

index_fd can now work with file descriptors that are not normal files
but any readable file. If the given file descriptor is a regular file
then mmap() is used; for other files, strbuf_read is used.

The path parameter, which has been used as hint for filters, can be
NULL now to indicate that the file should be hashed literally without
any filter.

The index_pipe function is removed as redundant.

Signed-off-by: Dmitry Potapov <dpotapov@xxxxxxxxx>
---
 cache.h       |    1 -
 hash-object.c |   29 +++++++++++--------------
 sha1_file.c   |   64 +++++++++++++++++++++++++++-----------------------------
 3 files changed, 44 insertions(+), 50 deletions(-)

git-hash-object before
   text    data     bss     dec     hex filename
 148751    1332   93164  243247   3b62f git-hash-object

and after patch
   text    data     bss     dec     hex filename
 148687    1332   93164  243183   3b5ef git-hash-object


diff --git a/cache.h b/cache.h
index 2475de9..68ce6e6 100644
--- a/cache.h
+++ b/cache.h
@@ -391,7 +391,6 @@ extern int ie_modified(const struct index_state *, struct cache_entry *, struct
 
 extern int ce_path_match(const struct cache_entry *ce, const char **pathspec);
 extern int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object, enum object_type type, const char *path);
-extern int index_pipe(unsigned char *sha1, int fd, const char *type, int write_object);
 extern int index_path(unsigned char *sha1, const char *path, struct stat *st, int write_object);
 extern void fill_stat_cache_info(struct cache_entry *ce, struct stat *st);
 
diff --git a/hash-object.c b/hash-object.c
index 46c06a9..ce027b9 100644
--- a/hash-object.c
+++ b/hash-object.c
@@ -8,28 +8,25 @@
 #include "blob.h"
 #include "quote.h"
 
-static void hash_object(const char *path, enum object_type type, int write_object)
+static void hash_fd(int fd, const char *type, int write_object, const char *path)
 {
-	int fd;
 	struct stat st;
 	unsigned char sha1[20];
-	fd = open(path, O_RDONLY);
-	if (fd < 0 ||
-	    fstat(fd, &st) < 0 ||
-	    index_fd(sha1, fd, &st, write_object, type, path))
+	if (fstat(fd, &st) < 0 ||
+	    index_fd(sha1, fd, &st, write_object, type_from_string(type), path))
 		die(write_object
 		    ? "Unable to add %s to database"
 		    : "Unable to hash %s", path);
 	printf("%s\n", sha1_to_hex(sha1));
 	maybe_flush_or_die(stdout, "hash to stdout");
 }
-
-static void hash_stdin(const char *type, int write_object)
+static void hash_object(const char *path, const char *type, int write_object)
 {
-	unsigned char sha1[20];
-	if (index_pipe(sha1, 0, type, write_object))
-		die("Unable to add stdin to database");
-	printf("%s\n", sha1_to_hex(sha1));
+	int fd;
+	fd = open(path, O_RDONLY);
+	if (fd < 0)
+		die("Cannot open %s", path);
+	hash_fd(fd, type, write_object, path);
 }
 
 static void hash_stdin_paths(const char *type, int write_objects)
@@ -45,7 +42,7 @@ static void hash_stdin_paths(const char *type, int write_objects)
 				die("line is badly quoted");
 			strbuf_swap(&buf, &nbuf);
 		}
-		hash_object(buf.buf, type_from_string(type), write_objects);
+		hash_object(buf.buf, type, write_objects);
 	}
 	strbuf_release(&buf);
 	strbuf_release(&nbuf);
@@ -116,13 +113,13 @@ int main(int argc, char **argv)
 			}
 
 			if (hashstdin) {
-				hash_stdin(type, write_object);
+				hash_fd(0, type, write_object, NULL);
 				hashstdin = 0;
 			}
 			if (0 <= prefix_length)
 				arg = prefix_filename(prefix, prefix_length,
 						      arg);
-			hash_object(arg, type_from_string(type), write_object);
+			hash_object(arg, type, write_object);
 			no_more_flags = 1;
 		}
 	}
@@ -131,6 +128,6 @@ int main(int argc, char **argv)
 		hash_stdin_paths(type, write_object);
 
 	if (hashstdin)
-		hash_stdin(type, write_object);
+		hash_fd(0, type, write_object, NULL);
 	return 0;
 }
diff --git a/sha1_file.c b/sha1_file.c
index e281c14..765a7e7 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2353,51 +2353,22 @@ int has_sha1_file(const unsigned char *sha1)
 	return has_loose_object(sha1);
 }
 
-int index_pipe(unsigned char *sha1, int fd, const char *type, int write_object)
+static int index_mem(unsigned char *sha1, void *buf, size_t size,
+		     int write_object, enum object_type type, const char *path)
 {
-	struct strbuf buf;
-	int ret;
-
-	strbuf_init(&buf, 0);
-	if (strbuf_read(&buf, fd, 4096) < 0) {
-		strbuf_release(&buf);
-		return -1;
-	}
-
-	if (!type)
-		type = blob_type;
-	if (write_object)
-		ret = write_sha1_file(buf.buf, buf.len, type, sha1);
-	else
-		ret = hash_sha1_file(buf.buf, buf.len, type, sha1);
-	strbuf_release(&buf);
-
-	return ret;
-}
-
-int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object,
-	     enum object_type type, const char *path)
-{
-	size_t size = xsize_t(st->st_size);
-	void *buf = NULL;
 	int ret, re_allocated = 0;
 
-	if (size)
-		buf = xmmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
-	close(fd);
-
 	if (!type)
 		type = OBJ_BLOB;
 
 	/*
 	 * Convert blobs to git internal format
 	 */
-	if ((type == OBJ_BLOB) && S_ISREG(st->st_mode)) {
+	if ((type == OBJ_BLOB) && path) {
 		struct strbuf nbuf;
 		strbuf_init(&nbuf, 0);
 		if (convert_to_git(path, buf, size, &nbuf,
 		                   write_object ? safe_crlf : 0)) {
-			munmap(buf, size);
 			buf = strbuf_detach(&nbuf, &size);
 			re_allocated = 1;
 		}
@@ -2411,8 +2382,35 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object,
 		free(buf);
 		return ret;
 	}
-	if (size)
+	return ret;
+}
+
+int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object,
+	     enum object_type type, const char *path)
+{
+	size_t size = xsize_t(st->st_size);
+	int ret;
+
+	if (!S_ISREG(st->st_mode))
+	{
+		struct strbuf sbuf;
+		strbuf_init(&sbuf, 0);
+		if (strbuf_read(&sbuf, fd, 4096) >= 0)
+			ret = index_mem(sha1, sbuf.buf, sbuf.len, write_object,
+					type, path);
+		else
+			ret = -1;
+		strbuf_release(&sbuf);
+	}
+	else if (size)
+	{
+		void *buf = xmmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
+		ret = index_mem(sha1, buf, size, write_object, type, path);
 		munmap(buf, size);
+	}
+	else
+		ret = index_mem(sha1, NULL, size, write_object, type, path);
+	close(fd);
 	return ret;
 }
 
-- 
1.6.0.rc1.53.gaeaa.dirty

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux