Re: [PATCH v6 1/1] http: Add Accept-Language header if possible

Eric Sunshine <sunshine@xxxxxxxxxxxxxx> · Wed, 24 Dec 2014 15:35:09 -0500

On Mon, Dec 22, 2014 at 11:44 AM, Yi EungJun <semtlenori@xxxxxxxxx> wrote:
> From: Yi EungJun <eungjun.yi@xxxxxxxxxxxxx>
>
> Add an Accept-Language header which indicates the user's preferred
> languages defined by $LANGUAGE, $LC_ALL, $LC_MESSAGES and $LANG.
>
> Examples:
>   LANGUAGE= -> ""
>   LANGUAGE=ko:en -> "Accept-Language: ko, en;q=0.9, *;q=0.1"
>   LANGUAGE=ko LANG=en_US.UTF-8 -> "Accept-Language: ko, *;q=0.1"
>   LANGUAGE= LANG=en_US.UTF-8 -> "Accept-Language: en-US, *;q=0.1"
>
> This gives git servers a chance to display remote error messages in
> the user's preferred language.
>
> Limit the number of languages to 1,000 because q-value must not be
> smaller than 0.001, and limit the length of Accept-Language header to
> 4,000 bytes for some HTTP servers which cannot accept such long header.

Just a few comments and observations below. Alone, they are not
necessarily worth a re-roll, but if you happen to re-roll for some
other reason, perhaps take them into consideration.

> Signed-off-by: Yi EungJun <eungjun.yi@xxxxxxxxxxxxx>
> ---
> diff --git a/http.c b/http.c
> index 040f362..7a77708 100644
> --- a/http.c
> +++ b/http.c
> @@ -986,6 +993,166 @@ static void extract_content_type(struct strbuf *raw, struct strbuf *type,
>                 strbuf_addstr(charset, "ISO-8859-1");
>  }
>
> +static void write_accept_language(struct strbuf *buf)
> +{
> + [...]
> +       /*
> +        * MAX_LANGS must not be larger than 1,000. If it is larger than that,
> +        * q-value will be smaller than 0.001, the minimum q-value the HTTP
> +        * specification allows [1].
> +        *
> +        * [1]: http://tools.ietf.org/html/rfc7231#section-5.3.1
> +        */
> +       const int MAX_LANGS = 1000;
> +       const int MAX_SIZE_OF_HEADER = 4000;
> +       const int MAX_SIZE_OF_ASTERISK_ELEMENT = 11; /* for ", *;q=0.001" */

These two MAX_SIZE_* constants are never used individually, but rather
only as (MAX_SIZE_OF_HEADER - MAX_SIZE_OF_ASTERISK_ELEMENT). It might
be a bit more readable to compute the final value here, with a
suitable comment, rather than at point-of-use. Perhaps something like:

    /* limit of some HTTP servers is 4000 - strlen(", *;q=0.001") */
    const int MAX_HEADER_SIZE = 4000 - 11;

More below.

> + [...]
> +       /*
> +        * Convert a list of colon-separated locale values [1][2] to a list of
> +        * comma-separated language tags [3] which can be used as a value of
> +        * Accept-Language header.
> + [...]
> +        */
> +       for (pos = lang_begin; ; pos++) {
> +               if (!*pos || *pos == ':') {
> +                       if (is_q_factor_required) {
> +                               /* Put a q-factor only if it is less than 1.0. */
> +                               if (q < max_q)
> +                                       strbuf_addf(buf, q_format, q);
> +
> +                               if (q > 1)
> +                                       q--;
> +
> +                               last_size = buf->len;
> +
> +                               is_q_factor_required = 0;
> +                       }
> +                       parse_state = SEPARATOR;
> +               } else if (parse_state == CODESET_OR_MODIFIER)
> +                       continue;
> +               else if (*pos == ' ') /* Ignore whitespace character */
> +                       continue;
> +               else if (*pos == '.' || *pos == '@') /* Remove .codeset and @modifier. */
> +                       parse_state = CODESET_OR_MODIFIER;
> +               else {
> +                       if (parse_state != LANGUAGE_TAG && q < max_q)
> +                               strbuf_addstr(buf, ", ");
> +                       strbuf_addch(buf, *pos == '_' ? '-' : *pos);
> +                       is_q_factor_required = 1;
> +                       parse_state = LANGUAGE_TAG;
> +               }
> +
> +               if (buf->len > MAX_SIZE_OF_HEADER - MAX_SIZE_OF_ASTERISK_ELEMENT) {
> +                       strbuf_remove(buf, last_size, buf->len - last_size);
> +                       break;
> +               }
> +
> +               if (!*pos)
> +                       break;
> +       }

Although often suitable when parsing complex inputs, state machines
demand high cognitive load. The input you're parsing, on the other
hand, is straightforward and can easily be processed with a simple
sequential parser, which is easier to reason about and review for
correctness. For instance, something like this:

    while (*s) {
        /* collect language tag */
        for (; *s && *s != '.' && *s != '@' && *s != ':'; s++)
            strbuf_addch(buf, *s == '_' ? '-' : *s);

        /* skip .codeset and @modifier */
        while (*s && *s != ':')
            s++;

        strbuf_addf(buf, q_format, q);
        ... other bookkeeping ...

        if (*s == ':')
            s++;
    }

This example is intentionally simplified but illustrates the general
idea. It lacks comma insertion (left as an exercise for the reader)
and empty language tag handling (":en", "en::ko"); and doesn't take
whitespace into consideration since it wasn't clear why your v6 parser
pays attention to embedded spaces, whereas your earlier versions did
not.

> +       /* Don't add Accept-Language header if no language is preferred. */
> +       if (q >= max_q) {
> +               strbuf_reset(buf);
> +               return;
> +       }
> +
> +       /* Add '*' with minimum q-factor greater than 0.0. */
> +       strbuf_addstr(buf, ", *");
> +       strbuf_addf(buf, q_format, 1);
> +}
> +
> +/*
> + * Get an Accept-Language header which indicates user's preferred languages.
> + *
> + * This function always return non-NULL string as strbuf_detach() does.

A couple comments:

It's not necessary to explain the public API in terms of an
implementation detail. Callers of this function don't care and don't
need to know that the value was constructed via strbuf, nor that it is
somehow dependent upon the behavior of the underlying implementation
of strbuf_detach().

This is a somewhat unusual contract. It's much more common and
idiomatic in C to return NULL as an indication of "no preference" (or
"failure") than to return an empty string.

> + *
> + * Examples:
> + *   LANGUAGE= -> ""
> + *   LANGUAGE=ko:en -> "Accept-Language: ko, en; q=0.9, *; q=0.1"
> + *   LANGUAGE=ko_KR.UTF-8:sr@latin -> "Accept-Language: ko-KR, sr; q=0.9, *; q=0.1"
> + *   LANGUAGE=ko LANG=en_US.UTF-8 -> "Accept-Language: ko, *; q=0.1"
> + *   LANGUAGE= LANG=en_US.UTF-8 -> "Accept-Language: en-US, *; q=0.1"
> + *   LANGUAGE= LANG=C -> ""
> + */
> +static const char *get_accept_language(void)
> +{
> +       if (!cached_accept_language) {
> +               struct strbuf buf = STRBUF_INIT;
> +               write_accept_language(&buf);
> +               cached_accept_language = strbuf_detach(&buf, NULL);
> +               strbuf_release(&buf);

Junio already mentioned that strbuf_release() is unnecessary following
strbuf_detach().

> +       }
> +
> +       return cached_accept_language;
> +}
> +
>  /* http_request() targets */
>  #define HTTP_REQUEST_STRBUF    0
>  #define HTTP_REQUEST_FILE      1
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html