Re: [Last-Call] Last Call: <draft-koster-rep-06.txt> (Robots Exclusion Protocol) to Informational RFC

"John Levine" <johnl@xxxxxxxxx> · 28 Feb 2022 17:29:30 -0500

It appears that Michael Richardson  <mcr+ietf@xxxxxxxxxxxx> said:
>It's good to see robots.txt coming to the IETF.

Agreed, also agree with Mnot's question whether we have reports from
other search engines that they follow this spec. Based on my
experience looking at my web server's log files and tweaking the
robots.txt files and looking at the web sites where they explain their
crawling practices, I think they do, but surely we know people at a few
other search engines.  I'd be particularly interested to hear who
interprets the * and $ pattern metacharacters.

Section 2.2.2 has this example of a path with a Unicode character:

   | /foo/bar/U+E38384 | /foo/bar/%E3%83%84    | /foo/bar/%E3%83%84    |

There is no U+E38384 character, but the UTF-8 version of the Japanese
character U+30C4 is hex E3 83 84 so I'm guessing that's what they meant.

The "Crawl-Delay" line is ignored by Google but followed by many other
search engines such as Bing and Yandex. I would describe it, with a
note that only some spiders use it.

Most importantly, the copyright license is broken. At the top it has
the "no derivatives" license, which is fine, but it also has code
sections in <CODE BEGINS>. The TLP specifically says that the code
license only applies RFCs that use the regular license, not any other
license. In this case the "code" sections are short snippets of sample
robots files with made up names and paths so I would take out the code flags.

R's,
John

-- 
last-call mailing list
last-call@xxxxxxxx
https://www.ietf.org/mailman/listinfo/last-call