Hi Matheus, On Sun, 24 Jul 2022, Matheus Tavares wrote: > This script is currently used by three test files: t0021-conversion.sh, > t2080-parallel-checkout-basics.sh, and > t2082-parallel-checkout-attributes.sh. To avoid the need for the PERL > dependency at these tests, let's convert the script to a C test-tool > command. Great! > - Squashed the two patches together. I see why this might have been suggested, but it definitely made it more challenging for me to review. You see, it is easy to just fly over a patch that simply removes the `PERL` prereq, but it is much harder to jump back and forth over all of these removals when the `.c` version of the filter is added before them and the `.pl` version is removed after them. So I find that it was bad advice, but I do not fault you for following it (we all want reviews to just be over already and therefore sometimes pander to the reviewers, no matter how much or little sense their feedback makes). It just would have been easier for me to review if the chaff was separated from the wheat, so to say. To illustrate my point: it was a bit of a challenge to find the adjustment of the "smudge write error at" needle in all of that cruft. It would have made my life as a reviewer substantially easier had the patch series been organized this way (which I assume you had before the feedback you received demanded to squash everything in one hot pile): 1/3 adjust the needle for the error message 2/3 implement the rot13-filter in C 3/3 use the test-tool in the tests and remove the PERL prereq, and remove rot13-filter.pl > [...] > diff --git a/pkt-line.c b/pkt-line.c > index 8e43c2def4..4425bdae36 100644 > --- a/pkt-line.c > +++ b/pkt-line.c > @@ -309,9 +309,10 @@ int write_packetized_from_fd_no_flush(int fd_in, int fd_out) > return err; > } > > -int write_packetized_from_buf_no_flush(const char *src_in, size_t len, int fd_out) > +int write_packetized_from_buf_no_flush_count(const char *src_in, size_t len, > + int fd_out, int *count_ptr) > { > - int err = 0; > + int err = 0, count = 0; > size_t bytes_written = 0; > size_t bytes_to_write; > > @@ -324,10 +325,18 @@ int write_packetized_from_buf_no_flush(const char *src_in, size_t len, int fd_ou > break; > err = packet_write_gently(fd_out, src_in + bytes_written, bytes_to_write); > bytes_written += bytes_to_write; > + count++; > } > + if (count_ptr) > + *count_ptr = count; This is not just a counter, but a packet counter, right? In any case, it would probably make more sense to increment the value directly: if (count_ptr) (*count_ptr)++; More on that below, where you use it. > return err; > } > > +int write_packetized_from_buf_no_flush(const char *src_in, size_t len, int fd_out) > +{ > + return write_packetized_from_buf_no_flush_count(src_in, len, fd_out, NULL); > +} Have you considered making this a `static inline` in `pkt-line.h`? > [...] > diff --git a/t/helper/test-rot13-filter.c b/t/helper/test-rot13-filter.c > new file mode 100644 > index 0000000000..536111f272 > --- /dev/null > +++ b/t/helper/test-rot13-filter.c > @@ -0,0 +1,393 @@ > +/* > + * Example implementation for the Git filter protocol version 2 > + * See Documentation/gitattributes.txt, section "Filter Protocol" > + * > + * Usage: test-tool rot13-filter [--always-delay] <log path> <capabilities> > + * > + * Log path defines a debug log file that the script writes to. The > + * subsequent arguments define a list of supported protocol capabilities > + * ("clean", "smudge", etc). > + * > + * When --always-delay is given all pathnames with the "can-delay" flag > + * that don't appear on the list bellow are delayed with a count of 1 > + * (see more below). > + * > + * This implementation supports special test cases: > + * (1) If data with the pathname "clean-write-fail.r" is processed with > + * a "clean" operation then the write operation will die. > + * (2) If data with the pathname "smudge-write-fail.r" is processed with > + * a "smudge" operation then the write operation will die. > + * (3) If data with the pathname "error.r" is processed with any > + * operation then the filter signals that it cannot or does not want > + * to process the file. > + * (4) If data with the pathname "abort.r" is processed with any > + * operation then the filter signals that it cannot or does not want > + * to process the file and any file after that is processed with the > + * same command. > + * (5) If data with a pathname that is a key in the delay hash is > + * requested (e.g. "test-delay10.a") then the filter responds with > + * a "delay" status and sets the "requested" field in the delay hash. > + * The filter will signal the availability of this object after > + * "count" (field in delay hash) "list_available_blobs" commands. > + * (6) If data with the pathname "missing-delay.a" is processed that the > + * filter will drop the path from the "list_available_blobs" response. > + * (7) If data with the pathname "invalid-delay.a" is processed that the > + * filter will add the path "unfiltered" which was not delayed before > + * to the "list_available_blobs" response. > + */ > + > +#include "test-tool.h" > +#include "pkt-line.h" > +#include "string-list.h" > +#include "strmap.h" > + > +static FILE *logfile; > +static int always_delay; > +static struct strmap delay = STRMAP_INIT; > +static struct string_list requested_caps = STRING_LIST_INIT_NODUP; > + > +static int has_capability(const char *cap) > +{ > + return unsorted_string_list_has_string(&requested_caps, cap); > +} > + > +static char *rot13(char *str) > +{ > + char *c; > + for (c = str; *c; c++) { > + if (*c >= 'a' && *c <= 'z') > + *c = 'a' + (*c - 'a' + 13) % 26; > + else if (*c >= 'A' && *c <= 'Z') > + *c = 'A' + (*c - 'A' + 13) % 26; That's quite verbose, but it _is_ correct (if a bit harder than necessary to validate, I admit that I had to look up whether `%`'s precedence is higher than `+` in https://en.cppreference.com/w/c/language/operator_precedence). A conciser way (also easier to reason about): for (c = str; *c; c++) if (isalpha(*c)) *c += tolower(*c) < 'n' ? 13 : -13; For fun, you could also look at https://hea-www.harvard.edu/~fine/Tech/rot13.html whether you want to use yet another approach. > + } > + return str; > +} > + > +static char *skip_key_dup(const char *buf, size_t size, const char *key) > +{ > + struct strbuf keybuf = STRBUF_INIT; > + strbuf_addf(&keybuf, "%s=", key); > + if (!skip_prefix_mem(buf, size, keybuf.buf, &buf, &size) || !size) > + die("bad %s: '%s'", key, xstrndup(buf, size)); > + strbuf_release(&keybuf); > + return xstrndup(buf, size); This does what we want it to do, but it looks as if it was code translated from a language that does not care one bit about allocations to a language that cares a lot. For example, instead of allocating a `strbuf` just to append `=` to the key, in idiomatic C this code would read like this: static char *get_value(char *buf, size_t size, const char *key) { const char *orig_buf = buf; int orig_size = (int)size; if (!skip_prefix_mem(buf, size, key, &buf, &size) || !skip_prefix_mem(buf, size, "=", &buf, &size) || !size) die("expected key '%s', got '%.*s'", key, orig_size, orig_buf); return xstrndup(buf, size); } I was tempted, even, to suggest returning a `const char *` after NUL-terminating the line (via `buf[size] = '\0';`) instead of `xstrndup()`ing it, but `packet_read_line()` reads into the singleton `packet_buffer` and we use e.g. the `command` that is returned from this function after reading the next packet, so the command would most likely be overwritten. > +} > + > +/* > + * Read a text packet, expecting that it is in the form "key=value" for > + * the given key. An EOF does not trigger any error and is reported > + * back to the caller with NULL. Die if the "key" part of "key=value" does > + * not match the given key, or the value part is empty. > + */ > +static char *packet_key_val_read(const char *key) > +{ > + int size; > + char *buf; > + if (packet_read_line_gently(0, &size, &buf) < 0) > + return NULL; > + return skip_key_dup(buf, size, key); > +} > + > +static void packet_read_capabilities(struct string_list *caps) > +{ > + while (1) { In Git's source code, I think we prefer `for (;;)`. But not by much: $ git grep 'while (1)' \*.c | wc 128 508 3745 $ git grep 'for (;;)' \*.c | wc 156 614 4389 > + int size; > + char *buf = packet_read_line(0, &size); > + if (!buf) > + break; > + string_list_append_nodup(caps, > + skip_key_dup(buf, size, "capability")); It is tempting to use unsorted string lists for everything because Perl makes that relatively easy. However, in this instance I would strongly recommend using something more akin to Perl's "hash" data structure, in this instance a `strset`. > + } > +} > + > +/* Read remote capabilities and check them against capabilities we require */ > +static void packet_read_and_check_capabilities(struct string_list *remote_caps, > + struct string_list *required_caps) > +{ > + struct string_list_item *item; > + packet_read_capabilities(remote_caps); > + for_each_string_list_item(item, required_caps) { > + if (!unsorted_string_list_has_string(remote_caps, item->string)) { > + die("required '%s' capability not available from remote", > + item->string); > + } > + } > +} This is a pretty literal translation from Perl to C, and a couple of years ago, I would have done the same. However, these days I would recommend against it. In this instance, we are really only interested in three capabilities: clean, smudge and delay. It is much, much simpler to read in the capabilities and then manually verify that the three required ones were included: static void read_capabilities(struct strset *remote_caps) { char *cap while ((cap = packet_key_val_read("capability"))) strset_add(remote_caps, cap); if (!strset_contains(remote_caps, "clean")) die("required 'clean' capability not available from remote"); if (!strset_contains(remote_caps, "smudge")) die("required 'smudge' capability not available from remote"); if (!strset_contains(remote_caps, "delay")) die("required 'delay' capability not available from remote"); } > + > +/* > + * Check our capabilities we want to advertise against the remote ones > + * and then advertise our capabilities > + */ > +static void packet_check_and_write_capabilities(struct string_list *remote_caps, > + struct string_list *our_caps) The list of "our caps" comes from the command-line. In C, this means we get a `const char **argv` and an `int argc`. So: static void check_and_write_capabilities(struct strset *remote_caps, const char **caps, int caps_count) { int i; for (i = 0; i < caps_count; i++) { if (!strset_contains(remote_caps, caps[i])) die("our capability '%s' is not available from remote", caps[i]); packet_write_fmt(1, "capability=%s\n", caps[i]); } packet_flush(1); } And then we would call it via check_and_write_capabilities(remote_caps, argv + 1, argc - 1); > + > +struct delay_entry { > + int requested, count; > + char *output; > +}; Since you declare this here, it makes most sense to define `free_delay_hash()` (which should really be named `free_delay_entries()`) and `add_delay_entry()` here. > + > +static void command_loop(void) > +{ > + while (1) { > + char *command = packet_key_val_read("command"); > + if (!command) { > + fprintf(logfile, "STOP\n"); > + break; > + } > + fprintf(logfile, "IN: %s", command); We will also need to `fflush(logfile)` here, to imitate the Perl script's behavior more precisely. > + > + if (!strcmp(command, "list_available_blobs")) { > + struct hashmap_iter iter; > + struct strmap_entry *ent; > + struct string_list_item *str_item; > + struct string_list paths = STRING_LIST_INIT_NODUP; > + > + /* flush */ > + if (packet_read_line(0, NULL)) > + die("bad list_available_blobs end"); > + > + strmap_for_each_entry(&delay, &iter, ent) { > + struct delay_entry *delay_entry = ent->value; > + if (!delay_entry->requested) > + continue; > + delay_entry->count--; > + if (!strcmp(ent->key, "invalid-delay.a")) { > + /* Send Git a pathname that was not delayed earlier */ > + packet_write_fmt(1, "pathname=unfiltered"); > + } > + if (!strcmp(ent->key, "missing-delay.a")) { > + /* Do not signal Git that this file is available */ > + } else if (!delay_entry->count) { > + string_list_insert(&paths, ent->key); > + packet_write_fmt(1, "pathname=%s", ent->key); > + } > + } > + > + /* Print paths in sorted order. */ The Perl script does not order them specifically. Do we really have to do that here? In any case, it is more performant to append the paths in an unsorted way and then sort them once in the end (that's O(N log(N)) instead of O(N^2)). > + for_each_string_list_item(str_item, &paths) > + fprintf(logfile, " %s", str_item->string); > + string_list_clear(&paths, 0); > + > + packet_flush(1); > + > + fprintf(logfile, " [OK]\n"); > + packet_write_fmt(1, "status=success"); > + packet_flush(1); I know the Perl script uses an else here, but I'd much rather insert a `continue` at the end of the `list_available_blobs` clause and de-indent the remainder of the loop body. > + } else { > + char *buf, *output; > + int size; > + char *pathname; > + struct delay_entry *entry; > + struct strbuf input = STRBUF_INIT; > + > + pathname = packet_key_val_read("pathname"); > + if (!pathname) > + die("unexpected EOF while expecting pathname"); > + fprintf(logfile, " %s", pathname); Again, let's `fflush(logfile)` here. > + > + /* Read until flush */ > + buf = packet_read_line(0, &size); > + while (buf) { Let's write this in more idiomatic C: while ((buf = packet_read_line(0, &size))) { > + if (!strcmp(buf, "can-delay=1")) { > + entry = strmap_get(&delay, pathname); > + if (entry && !entry->requested) { > + entry->requested = 1; > + } else if (!entry && always_delay) { > + entry = xcalloc(1, sizeof(*entry)); > + entry->requested = 1; > + entry->count = 1; > + strmap_put(&delay, pathname, entry); I guess here is our chance to extend the signature of `add_delay_entry()` to accept a `requested` parameter, and to call that here. > + } > + } else if (starts_with(buf, "ref=") || > + starts_with(buf, "treeish=") || > + starts_with(buf, "blob=")) { > + fprintf(logfile, " %s", buf); > + } else { > + /* > + * In general, filters need to be graceful about > + * new metadata, since it's documented that we > + * can pass any key-value pairs, but for tests, > + * let's be a little stricter. > + */ > + die("Unknown message '%s'", buf); > + } > + buf = packet_read_line(0, &size); > + } > + > + > + read_packetized_to_strbuf(0, &input, 0); > + fprintf(logfile, " %"PRIuMAX" [OK] -- ", (uintmax_t)input.len); This reads _so much nicer_ than the Perl version! > + > + entry = strmap_get(&delay, pathname); > + if (entry && entry->output) { > + output = entry->output; > + } else if (!strcmp(pathname, "error.r") || !strcmp(pathname, "abort.r")) { > + output = ""; > + } else if (!strcmp(command, "clean") && has_capability("clean")) { > + output = rot13(input.buf); > + } else if (!strcmp(command, "smudge") && has_capability("smudge")) { > + output = rot13(input.buf); > + } else { > + die("bad command '%s'", command); > + } > + > + if (!strcmp(pathname, "error.r")) { > + fprintf(logfile, "[ERROR]\n"); > + packet_write_fmt(1, "status=error"); > + packet_flush(1); > + } else if (!strcmp(pathname, "abort.r")) { > + fprintf(logfile, "[ABORT]\n"); > + packet_write_fmt(1, "status=abort"); > + packet_flush(1); > + } else if (!strcmp(command, "smudge") && > + (entry = strmap_get(&delay, pathname)) && > + entry->requested == 1) { > + fprintf(logfile, "[DELAYED]\n"); > + packet_write_fmt(1, "status=delayed"); > + packet_flush(1); > + entry->requested = 2; > + entry->output = xstrdup(output); We need to call `free(entry->output)` before that lest we leak memory, but only if `output` is not identical anyway: if (entry->output != output) { free(entry->output); entry->output = xstrdup(output); } > + } else { > + int i, nr_packets; > + size_t output_len; > + struct strbuf sb = STRBUF_INIT; > + packet_write_fmt(1, "status=success"); > + packet_flush(1); > + > + strbuf_addf(&sb, "%s-write-fail.r", command); > + if (!strcmp(pathname, sb.buf)) { We can easily avoid allocating the string just for comparing it: const char *p; if (skip_prefix(pathname, command, &p) && !strcmp(p, "-write-fail.r")) { > + fprintf(logfile, "[WRITE FAIL]\n"); fflush(logfile) ;-) > + die("%s write error", command); > + } > + > + output_len = strlen(output); > + fprintf(logfile, "OUT: %"PRIuMAX" ", (uintmax_t)output_len); > + > + if (write_packetized_from_buf_no_flush_count(output, > + output_len, 1, &nr_packets)) > + die("failed to write buffer to stdout"); > + packet_flush(1); > + > + for (i = 0; i < nr_packets; i++) > + fprintf(logfile, "."); That's not quite the same as the Perl script does: it prints a '.' (without flushing, though) _every_ time it wrote a packet. If you want to emulate that, you will have to copy/edit that loop (and in that case, the insanely long-named function `write_packetized_from_buf_no_flush_count()` is unnecessary, too). > + fprintf(logfile, " [OK]\n"); > + > + packet_flush(1); > + strbuf_release(&sb); > + } > + free(pathname); > + strbuf_release(&input); > + } > + free(command); > + } > +} > + > +static void free_delay_hash(void) > +{ > + struct hashmap_iter iter; > + struct strmap_entry *ent; > + > + strmap_for_each_entry(&delay, &iter, ent) { > + struct delay_entry *delay_entry = ent->value; > + free(delay_entry->output); > + free(delay_entry); > + } > + strmap_clear(&delay, 0); > +} > + > +static void add_delay_entry(char *pathname, int count) > +{ > + struct delay_entry *entry = xcalloc(1, sizeof(*entry)); > + entry->count = count; > + if (strmap_put(&delay, pathname, entry)) > + BUG("adding the same path twice to delay hash?"); > +} > + > +static void packet_initialize(const char *name, int version) > +{ > + struct strbuf sb = STRBUF_INIT; > + int size; > + char *pkt_buf = packet_read_line(0, &size); > + > + strbuf_addf(&sb, "%s-client", name); > + if (!pkt_buf || strncmp(pkt_buf, sb.buf, size)) We do not need the flexibility of the Perl package, where `name` is a parameter. We can hard-code `git-filter-client` here. I.e. something like this: if (!pkt_buf || size != 17 || strncmp(pkt_buf, "git-filter-client", 17)) > + die("bad initialize: '%s'", xstrndup(pkt_buf, size)); > + > + strbuf_reset(&sb); > + strbuf_addf(&sb, "version=%d", version); Same here. We do not need to allocate a string just to compare it to the packet's payload. > + pkt_buf = packet_read_line(0, &size); > + if (!pkt_buf || strncmp(pkt_buf, sb.buf, size)) > + die("bad version: '%s'", xstrndup(pkt_buf, size)); > + > + pkt_buf = packet_read_line(0, &size); > + if (pkt_buf) > + die("bad version end: '%s'", xstrndup(pkt_buf, size)); > + > + packet_write_fmt(1, "%s-server", name); > + packet_write_fmt(1, "version=%d", version); > + packet_flush(1); > + strbuf_release(&sb); > +} > + > +static char *rot13_usage = "test-tool rot13-filter [--always-delay] <log path> <capabilities>"; > + > +int cmd__rot13_filter(int argc, const char **argv) > +{ > + int i = 1; > + struct string_list remote_caps = STRING_LIST_INIT_DUP, > + supported_caps = STRING_LIST_INIT_NODUP; > + > + string_list_append(&supported_caps, "clean"); > + string_list_append(&supported_caps, "smudge"); > + string_list_append(&supported_caps, "delay"); > + > + if (argc > 1 && !strcmp(argv[i], "--always-delay")) { > + always_delay = 1; > + i++; > + } > + if (argc - i < 2) > + usage(rot13_usage); > + > + logfile = fopen(argv[i++], "a"); > + if (!logfile) > + die_errno("failed to open log file"); > + > + for ( ; i < argc; i++) > + string_list_append(&requested_caps, argv[i]); > + > + add_delay_entry("test-delay10.a", 1); > + add_delay_entry("test-delay11.a", 1); > + add_delay_entry("test-delay20.a", 2); > + add_delay_entry("test-delay10.b", 1); > + add_delay_entry("missing-delay.a", 1); > + add_delay_entry("invalid-delay.a", 1); > + > + fprintf(logfile, "START\n"); > + > + packet_initialize("git-filter", 2); > + > + packet_read_and_check_capabilities(&remote_caps, &supported_caps); > + packet_check_and_write_capabilities(&remote_caps, &requested_caps); > + fprintf(logfile, "init handshake complete\n"); > + > + string_list_clear(&supported_caps, 0); > + string_list_clear(&remote_caps, 0); > + > + command_loop(); > + > + fclose(logfile); > + string_list_clear(&requested_caps, 0); > + free_delay_hash(); > + return 0; > +} Other than that, this looks great! Thank you, Dscho