Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names

Jakub Narębski <jnareb@xxxxxxxxx> · Fri, 16 May 2014 09:54:58 +0200

On Fri, May 16, 2014 at 3:26 AM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
> Jakub Narębski <jnareb@xxxxxxxxx> writes:
>> On Thu, May 15, 2014 at 9:38 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
>>> Jakub Narębski <jnareb@xxxxxxxxx> writes:
>>>
>>>> Writing test for this would not be easy, and require some HTML
>>>> parser (WWW::Mechanize, Web::Scraper, HTML::Query, pQuery,
>>>> ... or low level HTML::TreeBuilder, or other low level parser).
>>>
>>> Hmph.  Is it more than just looking for a specific run of %xx we
>>> would expect to see in the output of the tree view for a repository
>>> in which there is one tree with non-ASCII name?
>>
>> There is if we want to check (in non-fragile way) that said
>> specific run is in 'href' *attribute* of 'a' element (link target).
>
> Correct, but is "where does it appear" the question we are
> primarily interested in, wrt this breakage and its fix?

That of course depends on how we want to test gitweb output.
The simplest solution, comparing with known output with perhaps
fragile / variable elements masked out could be done quickly...
but changes in output (even if they don't change functionality,
or don't change visible output) require regenerating test cases
(expected output) to test against - which might be source of
errors in test suite.

Another simple solution, grepping for expected strings, also
easy to create, has the disadvantage of being only positive
test - you cannot [easily] test that there are no *wrong* output,
only that right string exists somewhere.

> If gitweb output has some volatile parts that do not depend on the
> contents of the Git test repository (e.g. showing contents of
> /etc/motd, date/time of when the test was run, or the full pathname
> leading to the trash directory), then preparing a tree whose name is
> äéìõû and making sure that the properly encoded version of äéìõû
> appears anywhere in the output may not be sufficient to validate
> that we got the encoding right, as that string may appear in the
> parts that are totally unrelated to the contents being shown and not
> under our control.  But is that really the case?

Well, I guess that any test is better than no test (though OTOH
Heartbleed and "goto fail" bugs shows the importance of negative
tests).

> Also we may introduce a bug and misspell the attr name and produce
> an anchor element with hpef attribute with the properly encoded URL
> in it, and your "parse HTML properly" approach would catch it, but
> is that the kind of breakage under discussion?  You hinted at new
> tests for UTF-8 encoding in the other message in the thread earlier,
> and I've been assuming that we were talking about the encoding test,
> not a test to catch s/href/hpef/ kind of breakage.

One of tests possible with HTML parser (e.g. WWW::Mechanize::CGI)
is to check that all [internal] links leads to 200-OK pages, which
accidentally would also be a test against this breakage.

-- 
Jakub Narebski
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html