jshell> HexFormat.of().formatHex(new String(new char[] { 0xDEAD}).getBytes(UTF_8)) $4 ==> "3f"
It seems like the original 1.4 and older stuff have this issue.
thanks,
Rob
I think my original comment still stands: it seems infelicitous to call out Java's behavior, when the use of U+003F as a replacement character depends on the situation and Java's encoders/decoders are similar/identical to the performance of many others--and since the decoders that produce a String pretty much always use U+FFFD.
Rob, note that your link is to `CharsetEncoder`, which is going from a (UTF-16) String or CharacterSequence into a specific character encoding (which is the opposite direction from decoding a byte stream, which is what this section appears to be about). The actual handling text in `CharsetEncoder` says (emphasis added):
> The replacement is initially set to the encoder's default replacement, which often (but not always) has the initial value
{
(byte)'?'
}
; its value may be changed via thereplaceWith
method.
In any case, your text currently says:
> Responding to that risk, [UNICODE] section 3.2 recommends dealing with ill-formed byte sequences by signaling an error, or replacing problematic code points, ideally with "�" (U+FFFD, REPLACEMENT CHARACTER), although some popular software platforms, notably Java, use "?".
I would suggest instead:
> Responding to that risk, [UNICODE] section 3.2 recommends dealing with ill-formed byte sequences by signaling an error or replacing problematic code points, ideally with "�" (U+FFFD, REPLACEMENT CHARACTER). Some implementations that decode byte streams into characters and some encoders that produce character encodings (including UTF-8) sometimes use other characters, notably "?" (U+003A QUESTION MARK), to replace malformed or illegal character sequences. Often the replacement character depends on the specific character encoding. A few implementations use U+001A (SUB), which is a C0 control character.
You might also call out encoding.spec.whatwg.org, since many Web coders follow it.
thanks,
Addison
On 2/12/2025 9:19 AM, Rob Sayre wrote:
Yeah, here's the link:
That's from Java 1.4, and I doubt they can change it. I'm sure the newer APIs use �.
Whether that needs to be covered is not important to me.
thanks,Rob
On Wed, Feb 12, 2025 at 9:13 AM Tim Bray <tbray@xxxxxxxxxxxxxx> wrote:
Eh, I guess that language should be changed to say Java in some cases uses “?”. Or maybe just lose the whole “although some popular software platforms…” phrase. That � exists and is designed for this purpose is just a fact and worth stating. -T
On Feb 12, 2025 at 9:06:59 AM, Rob Sayre <sayrer@xxxxxxxxx> wrote:
On Mon, Feb 10, 2025 at 1:09 PM Addison Phillips <addisoni18n@xxxxxxxxx> wrote:#1983: Replacement character examples
replacing problematic code points, ideally with "�" (U+FFFD,
REPLACEMENT CHARACTER), although some popular software platforms,
notably Java, use "?".This is probably incorrect. Java replaces with U+FFFD in most Unicode processing (including decoding from legacy encodings). (Encoding to legacy encodings in Java use "?"). There are other places, such as certain browsers, where "?" is used in a Unicode context.
Apologies if I am misinterpreting, but I was surprised to learn that Java does use "?" sometimes. See this example:
thanks,Rob
--
last-call mailing list -- last-call@xxxxxxxx
To unsubscribe send an email to last-call-leave@xxxxxxxx
-- Addison Phillips Chair (W3C Internationalization WG) Internationalization is not a feature. It is an architecture.
-- last-call mailing list -- last-call@xxxxxxxx To unsubscribe send an email to last-call-leave@xxxxxxxx