Re: Should the IETF be condoning, even promoting, BOM pollution?

Matthew Kerwin <matthew@xxxxxxxxxxxxx> · Wed, 20 Sep 2017 07:06:42 +1000

On 20 Sep. 2017 03:29, "Julian Reschke" <julian.reschke@xxxxxx> wrote:
On 2017-09-19 19:17, John C Klensin wrote:

--On Tuesday, September 19, 2017 7:05 PM +0200 Julian Reschke

<julian.reschke@xxxxxx> wrote:

Not *defaulting* to UTF-8 is not a bug. It may not be what our

preference is nowadays, but that's it.

See about.  Slightly different discussion.   But I note that it

isn't hard to distinguish between Latin-1 and UTF-8 without

relying on BOM -- the hard problem there involves distinguishing

between the various species of 8859 and assorted code pages.

...

I agree that Notepad *could* be (heuristically) sniffing for UTF-8, and it would be interesting to hear why Microsoft doesn't do that.

Historically, because Windows uses/d UTF-16. See this decade old blog post, and particularly note the `dir > results.txt` snippet [1] 

By the way, when it comes to Notepad's heuristics, create a text file that says "Bill fed the goats" (without the quotes), then save and open it. Unless IsTextUnicode has been updated recently, this should break the sniffer.

[1] https://blogs.msdn.microsoft.com/oldnewthing/20070417-00/?p=27223

Cheers
--
Matthew Kerwin