> On 01 Aug 2016, at 00:19, Jakub Narębski <jnareb@xxxxxxxxx> wrote: > > W dniu 30.07.2016 o 01:38, larsxschneider@xxxxxxxxx pisze: > [...] >> +Please note that you cannot use an existing filter.<driver>.clean >> +or filter.<driver>.smudge command as filter.<driver>.process >> +command. > > I think it would be more readable and easier to understand to write: > > ... you cannot use an existing ... command with > filter.<driver>.process > > About the style: wouldn't `filter.<driver>.process` be better? OK, changed it! >> As soon as Git would detect a file that needs to be >> +processed by this filter, it would stop responding. > > This is quite convoluted, and hard to understand. I would say > that because `clean` and `smudge` filters are expected to read > first, while Git expects `process` filter to say first, using > `clean` or `smudge` filter without changes as `process` filter > would lead to git command deadlocking / hanging / stopping > responding. How about this: Please note that you cannot use an existing `filter.<driver>.clean` or `filter.<driver>.smudge` command with `filter.<driver>.process` because the former two use a different inter process communication protocol than the latter one. As soon as Git would detect a file that needs to be processed by such an invalid "process" filter, it would wait for a proper protocol handshake and appear "hanging". >> + >> + >> Interaction between checkin/checkout attributes >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> >> diff --git a/convert.c b/convert.c >> index 522e2c5..be6405c 100644 >> --- a/convert.c >> +++ b/convert.c >> @@ -3,6 +3,7 @@ >> #include "run-command.h" >> #include "quote.h" >> #include "sigchain.h" >> +#include "pkt-line.h" >> >> /* >> * convert.c - convert a file when checking it out and checking it in. >> @@ -481,11 +482,355 @@ static int apply_filter(const char *path, const char *src, size_t len, int fd, >> return ret; >> } >> >> +static int multi_packet_read(int fd_in, struct strbuf *sb, size_t expected_bytes, int is_stream) > > About name of this function: `multi_packet_read` is fine, though I wonder > if `packet_read_in_full` with nearly the same parameters as `packet_read`, > or `packet_read_till_flush`, or `read_in_full_packetized` would be better. I like `multi_packet_read` and will rename! > Also, the problem is that while we know that what packet_read() stores > would fit in memory (in size_t), it is not true for reading whole file, > which might be very large - for example huge graphical assets like raw > images or raw videos, or virtual machine images. Isn't that the goal > of git-LFS solutions, which need this feature? Shouldn't we have then > both `multi_packet_read_to_fd` and `multi_packet_read_to_buf`, > or whatever? Git LFS works well with the current clean/smudge mechanism that uses the same on in memory buffers. I understand your concern but I think this improvement is out of scope for this patch series. > Also, if we have `fd_in`, then perhaps `sb_out`? Agreed! > I am also unsure if `expected_bytes` (or `expected_size`) should not be > just a size hint, leaving handing mismatch between expected size and > real size of output to the caller; then the `is_stream` would be not > needed. As mentioned in a previous email... I will drop the "size" support in this patch series as it is not really needed. >> +{ >> + int bytes_read; >> + size_t total_bytes_read = 0; > > Why `bytes_read` is int, while `total_bytes_read` is size_t? Ah, I see > that packet_read() returns an int. It should be ssize_t, just like > read(), isn't it? But we know that packet size is limited, and would > fit in an int (or would it?). Yes, it is limited but I agree on ssize_t! > Also, total_bytes_read could overflow size_t, but then we would have > problems storing the result in strbuf. Would that check be ok? if (total_bytes_read > SIZE_MAX - bytes_read) return 1; // `total_bytes_read` would overflow and is not representable >> + if (expected_bytes == 0 && !is_stream) >> + return 0; > > So in all cases *except* size = 0 we expect flush packet after the > contents, but size = 0 is a corner case without flush packet? I agree that is inconsistent... I will change it! >> + >> + if (is_stream) >> + strbuf_grow(sb, LARGE_PACKET_MAX); // allocate space for at least one packet >> + else >> + strbuf_grow(sb, st_add(expected_bytes, 1)); // add one extra byte for the packet flush >> + >> + do { >> + bytes_read = packet_read( >> + fd_in, NULL, NULL, >> + sb->buf + total_bytes_read, sb->len - total_bytes_read - 1, >> + PACKET_READ_GENTLE_ON_EOF >> + ); >> + if (bytes_read < 0) >> + return 1; // unexpected EOF > > Don't we usually return negative numbers on error? Ah, I see that the > return is a bool, which allows to use boolean expression with 'return'. > But I am still unsure if it is good API, this return value. According to Peff zero for success is the usual style: http://public-inbox.org/git/20160728133523.GB21311%40sigill.intra.peff.net/ > If we move handling of size mismatch to the caller, then the function > can simply return the size of data read (probably off_t or uint64_t). > Then the caller can check if it is what it expected, and react accordingly. True, but as discussed previously I will remove the size. >> + >> + if (is_stream && >> + bytes_read > 0 && >> + sb->len - total_bytes_read - 1 <= 0) >> + strbuf_grow(sb, st_add(sb->len, LARGE_PACKET_MAX)); >> + total_bytes_read += bytes_read; >> + } >> + while ( >> + bytes_read > 0 && // the last packet was no flush >> + sb->len - total_bytes_read - 1 > 0 // we still have space left in the buffer > > Ah, so buffer is resized only in the 'is_stream' case. Perhaps then > use an "int options" instead of 'is_stream', and have one of flags > tell if we should resize or not, that is if size parameter is hint > or a strict limit. Obsolete >> + ); >> + strbuf_setlen(sb, total_bytes_read); >> + return (is_stream ? 0 : expected_bytes != total_bytes_read); >> +} >> + >> +static int multi_packet_write_from_fd(const int fd_in, const int fd_out) > > Is it equivalent of copy_fd() function, but where destination uses pkt-line > and we need to pack data into pkt-lines? Correct! >> +{ >> + int did_fail = 0; >> + ssize_t bytes_to_write; >> + while (!did_fail) { >> + bytes_to_write = xread(fd_in, PKTLINE_DATA_START(packet_buffer), PKTLINE_DATA_LEN); > > Using global variable packet_buffer makes this code thread-unsafe, isn't it? > But perhaps that is not a problem, because other functions are also > using this global variable. Correct! > It is more of PKTLINE_DATA_MAXLEN, isn't it? Agreed, will change! > >> + if (bytes_to_write < 0) >> + return 1; >> + if (bytes_to_write == 0) >> + break; >> + did_fail |= direct_packet_write(fd_out, packet_buffer, PKTLINE_HEADER_LEN + bytes_to_write, 1); >> + } >> + if (!did_fail) >> + did_fail = packet_flush_gentle(fd_out); > > Shouldn't we try to flush even if there was an error? Or is it > that if there is an error writing, then there is some problem > such that we know that flush would not work? Right, that's what I though. >> + return did_fail; > > Return true on fail? Shouldn't we follow example of copy_fd() > from copy.c, and return COPY_READ_ERROR, or COPY_WRITE_ERROR, > or PKTLINE_WRITE_ERROR? OK. How about this? static int multi_packet_write_from_fd(const int fd_in, const int fd_out) { int did_fail = 0; ssize_t bytes_to_write; while (!did_fail) { bytes_to_write = xread(fd_in, PKTLINE_DATA_START(packet_buffer), PKTLINE_DATA_MAXLEN); if (bytes_to_write < 0) return COPY_READ_ERROR; if (bytes_to_write == 0) break; did_fail |= direct_packet_write(fd_out, packet_buffer, PKTLINE_HEADER_LEN + bytes_to_write, 1); } if (!did_fail) did_fail = packet_flush_gently(fd_out); return (did_fail ? COPY_WRITE_ERROR : 0); } >> +} >> + >> +static int multi_packet_write_from_buf(const char *src, size_t len, int fd_out) > > It is equivalent of write_in_full(), with different order of parameters, > but where destination file descriptor expects pkt-line and we need to pack > data into pkt-lines? True. Do you suggest to reorder parameters? I also would like to rename `src` to `src_in`, OK? > > NOTE: function description comments? What do you mean here? >> +{ >> + int did_fail = 0; >> + size_t bytes_written = 0; >> + size_t bytes_to_write; > > Note to self: bytes_to_write should fit in size_t, as it is limited to > PKTLINE_DATA_LEN. bytes_written should fit in size_t, as it is at most > len, which is of type size_t. > >> + while (!did_fail) { >> + if ((len - bytes_written) > PKTLINE_DATA_LEN) >> + bytes_to_write = PKTLINE_DATA_LEN; >> + else >> + bytes_to_write = len - bytes_written; >> + if (bytes_to_write == 0) >> + break; >> + did_fail |= direct_packet_write_data(fd_out, src + bytes_written, bytes_to_write, 1); >> + bytes_written += bytes_to_write; > > Ah, I see now why we need both direct_packet_write() and > direct_packet_write_data(). Nice abstraction, makes for > clear code. > > The last parameter of '1' means 'gently', isn't it? Correct. Thanks :) >> + } >> + if (!did_fail) >> + did_fail = packet_flush_gentle(fd_out); >> + return did_fail; >> +} > > I think all three/four of those functions should be added in a separate > commit, separate patch in patch series. OK > Namely: > > - for git -> filter: > * read from fd, write pkt-line to fd (off_t) > * read from str+len, write pkt-line to fd (size_t, ssize_t) > - for filter -> git: > * read pkt-line from fd, write to fd (off_t) This one does not exist. > * read pkt-line from fd, write to str+len (size_t, ssize_t) > > Perhaps some of those can be in one overloaded function, perhaps it would > be easier to keep them separate. I would like to keep them separate as it is easier to comprehend. > > Also, I do wonder how the fetch / push code spools pack file received > over pkt-lines to disk. Can we reuse that code? I haven't found any. > Or maybe that code > could use those new functions? I think so, but this would be out of scope for this series :) >> + >> +#define FILTER_CAPABILITIES_STREAM 0x1 >> +#define FILTER_CAPABILITIES_CLEAN 0x2 >> +#define FILTER_CAPABILITIES_SMUDGE 0x4 >> +#define FILTER_CAPABILITIES_SHUTDOWN 0x8 >> +#define FILTER_SUPPORTS_STREAM(type) ((type) & FILTER_CAPABILITIES_STREAM) >> +#define FILTER_SUPPORTS_CLEAN(type) ((type) & FILTER_CAPABILITIES_CLEAN) >> +#define FILTER_SUPPORTS_SMUDGE(type) ((type) & FILTER_CAPABILITIES_SMUDGE) >> +#define FILTER_SUPPORTS_SHUTDOWN(type) ((type) & FILTER_CAPABILITIES_SHUTDOWN) >> + >> +struct cmd2process { >> + struct hashmap_entry ent; /* must be the first member! */ >> + const char *cmd; >> + int supported_capabilities; > > I wonder if switching from int (perhaps with field width of 1 to denote > that it is boolean-like flag) to mask makes it more readable, or less. > But I think it is. > > > Reading Documentation/technical/api-hashmap.txt I found the following > recommendation: > > `struct hashmap_entry`:: > > An opaque structure representing an entry in the hash table, which must > be used as first member of user data structures. Ideally it should be > followed by an int-sized member to prevent unused memory on 64-bit > systems due to alignment. > > Therefore it "int supported_capabilities" should precede > "const char *cmd", I think. Though it is not strictly necessary; it > is not as if this hash table were large (maximum size is limited by > the number of filter drivers configured), so we don't waste much space > due to internal padding / due to alignment. Thanks! I will change it to your suggestion anyway! > >> + struct child_process process; >> +}; >> + >> +static int cmd_process_map_initialized = 0; >> +static struct hashmap cmd_process_map; > > Reading Documentation/technical/api-hashmap.txt I see that: > > `tablesize` is the allocated size of the hash table. A non-0 value indicates > that the hashmap is initialized. > > So cmd_process_map_initialized is not really needed, is it? I copied that from config.c: https://github.com/git/git/blob/f8f7adce9fc50a11a764d57815602dcb818d1816/config.c#L1425-L1428 `git grep "tablesize"` reveals that the check for `tablesize` is only used in hashmap.c ... so what approach should we use? >> + >> +static int cmd2process_cmp(const struct cmd2process *e1, >> + const struct cmd2process *e2, >> + const void *unused) >> +{ >> + return strcmp(e1->cmd, e2->cmd); >> +} > > Well, to be exact (which is decidely not needed!) two commands might > be equivalent not being identical as strings (e.g. extra space between > parameters). But it is something the user should care about, not Git. > >> + >> +static struct cmd2process *find_protocol2_filter_entry(struct hashmap *hashmap, const char *cmd) > > I'm not sure if *_protocol2_* is needed; those functions are static, > local to convert.c. I want to make sure that the reader understands that these functions are related to the filter protocol version 2. Not OK? >> +{ >> + struct cmd2process k; > > Does this name of variable 'k' follow established convention? > 'key' would be more descriptive, but it's not as if this function > was long; so 'k' is all right, I think. I agree on "key". > >> + hashmap_entry_init(&k, strhash(cmd)); >> + k.cmd = cmd; >> + return hashmap_get(hashmap, &k, NULL); >> +} >> + >> +static void kill_protocol2_filter(struct hashmap *hashmap, struct cmd2process *entry) { > > Programming style: the opening brace should be on separate line, > that is: > > +static void kill_protocol2_filter(struct hashmap *hashmap, struct cmd2process *entry) > +{ Agreed! >> + if (!entry) >> + return; >> + sigchain_push(SIGPIPE, SIG_IGN); >> + close(entry->process.in); >> + close(entry->process.out); >> + sigchain_pop(SIGPIPE); >> + finish_command(&entry->process); >> + child_process_clear(&entry->process); >> + hashmap_remove(hashmap, entry, NULL); >> + free(entry); >> +} > > All those, from #define FILTER_CAPABILITIES_ to here could be put > in a separate patch, to reduce size of this one. But I am less > sure that it is worth it for this case. > >> + >> +void shutdown_protocol2_filter(pid_t pid) >> +{ > [...] > > In my opinion this should be postponed to a separate commit. Agreed! > >> +} >> + >> +static struct cmd2process *start_protocol2_filter(struct hashmap *hashmap, const char *cmd) > > This has some parts in common with existing filter_buffer_or_fd(). > I wonder if it would be worth to extract those common parts. > > But perhaps it would be better to leave such refactoring for later. > >> +{ >> + int did_fail; >> + struct cmd2process *entry; >> + struct child_process *process; >> + const char *argv[] = { cmd, NULL }; >> + struct string_list capabilities = STRING_LIST_INIT_NODUP; >> + char *capabilities_buffer; >> + int i; >> + >> + entry = xmalloc(sizeof(*entry)); >> + hashmap_entry_init(entry, strhash(cmd)); >> + entry->cmd = cmd; >> + entry->supported_capabilities = 0; >> + process = &entry->process; >> + >> + child_process_init(process); > > filter_buffer_or_fd() uses instead > > struct child_process child_process = CHILD_PROCESS_INIT; > > But I see that you need to access &entry->process anyway, so you > need to have it here, and in this case child_process_init() is > equivalent. > > I wonder if it would be worth it to use strbuf for cmd. What do you mean by "worth it to use strbuf for cmd"? Why would we need a strbuf? >> + process->argv = argv; >> + process->use_shell = 1; >> + process->in = -1; >> + process->out = -1; >> + process->clean_on_exit = 1; >> + process->clean_on_exit_handler = shutdown_protocol2_filter; > > These two lines are new, and related to the "shutdown" capability, isn't it? Yes. > >> + >> + if (start_command(process)) { >> + error("cannot fork to run external filter '%s'", cmd); >> + kill_protocol2_filter(hashmap, entry); > > I guess the alternative solution of adding filter to the hashmap only > after starting the process would be racy? > > Ah, disregard that. I see that this pattern is a common way to error > out in this function (for process-related errors). > >> + return NULL; >> + } >> + >> + sigchain_push(SIGPIPE, SIG_IGN); >> + did_fail = strcmp(packet_read_line(process->out, NULL), "git-filter-protocol"); >> + if (!did_fail) >> + did_fail |= strcmp(packet_read_line(process->out, NULL), "version 2"); >> + if (!did_fail) >> + capabilities_buffer = packet_read_line(process->out, NULL); >> + else >> + capabilities_buffer = NULL; >> + sigchain_pop(SIGPIPE); >> + >> + if (!did_fail && capabilities_buffer) { >> + string_list_split_in_place(&capabilities, capabilities_buffer, ' ', -1); >> + if (capabilities.nr > 1 && >> + !strcmp(capabilities.items[0].string, "capabilities")) { >> + for (i = 1; i < capabilities.nr; i++) { >> + const char *requested = capabilities.items[i].string; >> + if (!strcmp(requested, "stream")) { >> + entry->supported_capabilities |= FILTER_CAPABILITIES_STREAM; >> + } else if (!strcmp(requested, "clean")) { >> + entry->supported_capabilities |= FILTER_CAPABILITIES_CLEAN; >> + } else if (!strcmp(requested, "smudge")) { >> + entry->supported_capabilities |= FILTER_CAPABILITIES_SMUDGE; >> + } else if (!strcmp(requested, "shutdown")) { >> + entry->supported_capabilities |= FILTER_CAPABILITIES_SHUTDOWN; >> + } else { >> + warning( >> + "external filter '%s' requested unsupported filter capability '%s'", >> + cmd, requested >> + ); >> + } >> + } >> + } else { >> + error("filter capabilities not found"); >> + did_fail = 1; >> + } >> + string_list_clear(&capabilities, 0); >> + } > > I wonder if the above conditional wouldn't be better to be put in > a separate function, parse_filter_capabilities(capabilities_buffer), > returning a mask, or having mask as an out parameter, and returning > an error condition. Agreed. >> + >> + if (did_fail) { >> + error("initialization for external filter '%s' failed", cmd); > > More detailed information not needed, because one can use GIT_PACKET_TRACE. > Would it be worth add this information as a kind of advice, or put it > in the documentation of the `process` option? I will put it into the docs. > >> + kill_protocol2_filter(hashmap, entry); >> + return NULL; >> + } >> + >> + hashmap_add(hashmap, entry); >> + return entry; >> +} >> + >> +static int apply_protocol2_filter(const char *path, const char *src, size_t len, >> + int fd, struct strbuf *dst, const char *cmd, >> + const int wanted_capability) > > apply_protocol2_filter, or apply_process_filter? Or rather, > s/_protocol2_/_process_/g ? Mh. I wanted to convey that this functions is protocol V2 related... > > This is equivalent to > > static int apply_filter(const char *path, const char *src, size_t len, int fd, > struct strbuf *dst, const char *cmd) > > Could we have extended that one instead? Initially I had one function but that got kind of long ... I prefer two for now. >> +{ >> + int ret = 1; >> + struct cmd2process *entry; >> + struct child_process *process; >> + struct stat file_stat; >> + struct strbuf nbuf = STRBUF_INIT; >> + size_t expected_bytes = 0; >> + char *strtol_end; >> + char *strbuf; >> + char *filter_type; >> + char *filter_result = NULL; >> + > >> + if (!cmd || !*cmd) >> + return 0; >> + >> + if (!dst) >> + return 1; > > This is the same as in apply_filter(). > >> + >> + if (!cmd_process_map_initialized) { >> + cmd_process_map_initialized = 1; >> + hashmap_init(&cmd_process_map, (hashmap_cmp_fn) cmd2process_cmp, 0); >> + entry = NULL; >> + } else { >> + entry = find_protocol2_filter_entry(&cmd_process_map, cmd); >> + } > > Here we try to find existing process, rather than starting new > as in apply_filter() > >> + >> + fflush(NULL); > > This is the same as in apply_filter(), but I wonder what it is for. "If the stream argument is NULL, fflush() flushes all open output streams." http://man7.org/linux/man-pages/man3/fflush.3.html > >> + >> + if (!entry) { >> + entry = start_protocol2_filter(&cmd_process_map, cmd); >> + if (!entry) { >> + return 0; >> + } > > Style; we prefer: > > + if (!entry) > + return 0; Agreed. > This is very similar to apply_filter(), but the latter uses start_async() > from "run-command.h", with filter_buffer_or_fd() as asynchronous process, > which gets passed command to run in struct filter_params. In this > function start_protocol2_filter() runs start_command(), synchronous API. > > Why the difference? The protocol V2 requires a sequential processing of the packets. See discussion with Junio here: http://public-inbox.org/git/xmqqbn1th5qn.fsf%40gitster.mtv.corp.google.com/ [LONG SNIP] I will answer the second half in a separate email. Thanks for the review, Lars -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html