[Last-Call] Re: Last Call: <draft-bray-unichars-10.txt> (Unicode Character Repertoire Subsets): W3C I18N Review

Rob Sayre <sayrer@xxxxxxxxx> · Wed, 12 Feb 2025 10:49:19 -0800

Oh, I don't care about preserving the current text. I went back to look at why it got in there in the first place, to make sure I understood.
jshell> HexFormat.of().formatHex(new String(new char[] { 0xDEAD}).getBytes(UTF_8))

$4 ==> "3f"
It seems like the original 1.4 and older stuff have this issue.
thanks,
Rob

On Wed, Feb 12, 2025 at 10:37 AM Addison Phillips <addisoni18n@xxxxxxxxx> wrote:

    I think my original comment still stands: it seems infelicitous
      to call out Java's behavior, when the use of U+003F as a
      replacement character depends on the situation and Java's
      encoders/decoders are similar/identical to the performance of many
      others--and since the decoders that produce a String pretty much
      always use U+FFFD.

    Rob, note that your link is to `CharsetEncoder`, which is going
      from a (UTF-16) String or CharacterSequence into a
      specific character encoding (which is the opposite direction from
      decoding a byte stream, which is what this section appears to be
      about). The actual handling text in `CharsetEncoder` says
      (emphasis added):

      > The replacement is initially set to the encoder's
            default replacement, which often (but not always)
        has the initial value { (byte)'?' };
        its value may be changed via the replaceWith
        method. 

    In any case, your text currently says:

      > Responding to that risk, [UNICODE] section 3.2 recommends
        dealing with ill-formed byte sequences by signaling an error, or
        replacing problematic code points, ideally with "�" (U+FFFD,
        REPLACEMENT CHARACTER), although some popular software
        platforms, notably Java, use "?".

    I would suggest instead:

      > Responding to that risk, [UNICODE] section 3.2 recommends
        dealing with ill-formed byte sequences by signaling an error or
        replacing problematic code points, ideally with "�" (U+FFFD,
        REPLACEMENT CHARACTER). Some implementations that decode byte
        streams into characters and some encoders that produce character
        encodings (including UTF-8) sometimes use other characters,
        notably "?" (U+003A QUESTION MARK), to replace malformed or
        illegal character sequences. Often the replacement character
        depends on the specific character encoding. A few
        implementations use U+001A (SUB), which is a C0 control
        character.

    You might also call out encoding.spec.whatwg.org, since many Web
      coders follow it.

    thanks,
    Addison

    On 2/12/2025 9:19 AM, Rob Sayre wrote:

      Yeah, here's the link:

        <https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/nio/charset/CharsetEncoder.html#:~:text=How%20an%20encoding%20error%20is%20handled%20depends>

        That's from Java 1.4, and I doubt they can change it. I'm
          sure the newer APIs use �.

        Whether that needs to be covered is not important to me.

        thanks,
        Rob

        On Wed, Feb 12, 2025 at
          9:13 AM Tim Bray <tbray@xxxxxxxxxxxxxx>
          wrote:

             Eh, I guess that language should be changed
              to say Java in some cases uses “?”.  Or maybe just lose
              the whole “although some popular software platforms…”
              phrase. That � exists and is
                designed for this purpose is just a fact and worth
                stating. -T

              On Feb 12, 2025 at
                9:06:59 AM, Rob Sayre <sayrer@xxxxxxxxx>
                wrote:

                    On Mon, Feb 10,
                      2025 at 1:09 PM Addison Phillips <addisoni18n@xxxxxxxxx>
                      wrote:

                        #1983: Replacement character
                          examples

                          replacing problematic code
                            points, ideally with "�" (U+FFFD,

                            REPLACEMENT CHARACTER), although some
                            popular software platforms,

                            notably Java, use "?".

                        This is probably incorrect. Java
                          replaces with U+FFFD in most Unicode
                          processing (including decoding from legacy
                          encodings). (Encoding to legacy
                          encodings in Java use "?"). There are other
                          places, such as certain browsers, where "?" is
                          used in a Unicode context.

                    Apologies if I am misinterpreting, but I was
                      surprised to learn that Java does use "?"
                      sometimes. See this example:

                    https://mailarchive.ietf.org/arch/msg/art/ct0kFnolvi6WHJoTCQC6g1FgxHE/ 

                    thanks,
                    Rob

                   -- 

                    last-call mailing list -- last-call@xxxxxxxx

                    To unsubscribe send an email to last-call-leave@xxxxxxxx

    -- 
Addison Phillips
Chair (W3C Internationalization WG)

Internationalization is not a feature.
It is an architecture.

-- 
last-call mailing list -- last-call@xxxxxxxx
To unsubscribe send an email to last-call-leave@xxxxxxxx