Re: WORD to text problem

Alan Mead <amead@xxxxxxxxxxxx> · Tue, 5 May 2020 11:54:07 -0500

    On 5/5/2020 10:08 AM, Tedd Sperling wrote:

        On May 3, 2020, at 3:29 PM, Ashickur Rahman <ashickur.noor@xxxxxxxxx> wrote:

Did you try <textarea> for input. I believe it will eliminate  all text formating. 

      Hmmm?  Why didn’t I think of that?

Why? Because I have had clients in the past who have used Word and it caused problems. But now, as per your suggestion, I’ve tested it and the problem seems to have disappeared. Interesting...

Thanks 

Tedd

PS: Who said you can’t teach an old dog new tricks. Moof!

    Tedd,

    I'm working on this problem myself (but my project is in Perl). I
    think the OS actually picks what kind of data to paste based on the
    target, but things like italics and bold get removed because they're
    not characters. If the paste target advertised itself as having rich
    text, then I think Windows (for example) would paste content with
    bold, italics, font styles and sizes, etc. 

    But those effects don't get pasted because they're not based on
    characters, however copied characters are pasted and that remains an
    issue. And the bane of my existence are "smart quotes" but bullets,
    ellipses, dashes, etc. are also an issue, as are characters required
    for non-English alphabets which is particularly a problem for
    certain words (resumé) and for names. These are all extended
    "Latin1" characters (i.e., not ASCII) and their interpretation is
    different in different encodings and this is what causes the
    problem.

    And you can use a regex to fix these, but you need to guess the
    encoding. A regex that works for Latin1 will fail for UTF8 (in my
    experience).

    I've been following your thread to see what people suggest about
    this. I guess the basic "Joel Spolsky" advice for programmers is to
    always know the encoding and use the correct one, which makes
    perfect sense if you're the one controlling the data. But you and I
    are in a situation where the user gets to pick and it will vary and
    I don't think they can tell us (if they knew, which they don't). I
    haven't solved this problem, but I assume that we have to detect the
    encoding.

    Anyway, in case it helps you, I have included a long list of regex
    replacements I developed (in Perl) to replace problematic
    characters. These regexes look for hex codes, like 0x85 is Word's
    ellipsis character, and change them into an approximate (sometimes
    very approximate) string that can be represented in ASCII. It covers
    all the characters that I thought I might encounter from Word. These
    might be too specific to Word for Windows using the English. And
    these fail for some files with an unexpected encoding. Let me know
    if the regex is unclear (I think Perl and PHP share a regex syntax,
    but the syntax to perform a regex differs?).

    I was actually hoping someone chimed in and described an easier way
    to deal with this issue. I hope this is not the best way to fix this
    problem.

    -Alan

    # fix HTML entities

    $string =~ s/&#821[67];/'/gsm; # single quote

    $string =~ s/&#821[12];/--/gsm; # em dash

    $string =~ s/&#821[89];/'/gsm; # single quote

    $string =~ s/&#822[0123];/"/gsm; # double quote

    $string =~ s/&#8226;/+/gsm; # bullet

    $string =~ s/&#8230;/.../gsm; # ellipsis

    # fix extended Latin 1/ISO-8859-1 characters

    $string =~ s/\x85/.../gsm; # elipsis

    $string =~ s/[\x91\x92]/'/gsm; # single quotes

    $string =~ s/[\x93\x94]/"/gsm; # double quotes

    $string =~ s/\x95/*/gsm;  # bullet

    $string =~ s/[\x96\x97]/--/gsm; # dashes

    $string =~ s/\xa0/ /gsm;  # non-breaking space

    $string =~ s/\xa1/\!/g;  # inverted exclamation mark

    $string =~ s/\xa2//g;  # cent sign

    $string =~ s/\xa3//g;  # pound sign

    $string =~ s/\xa4//g;  # currency sign

    $string =~ s/\xa5//g;  # yen/yuan sign

    $string =~ s/\xa6/\|/g;  # pipe

    $string =~ s/\xa7//g;  # section sign

    $string =~ s/\xa8//g;  # diaerasis

    $string =~ s/\xa9/(c)/g;  # copyright sign

    $string =~ s/\xaa//g;  # feminine ordinal indicator

    $string =~ s/\xab/"/g;  # left-pointing double angle quotation mark

    $string =~ s/\xac/-/g;  # not sign

    $string =~ s/\xad//g;  # soft hyphen = discretionary hyphen

    $string =~ s/\xae/(R)/g;  # registered sign = registered trade mark
    sign

    $string =~ s/\xaf/-/g;  # macron = spacing macron = overline = APL
    overbar

    $string =~ s/\xb0//g;  # degree sign

    $string =~ s/\xb1/+-/g;  # plus or minus

    $string =~ s/\xb2/^2/g;  # squared

    $string =~ s/\xb3/^3/g;  # cubed

    $string =~ s/\xb4/'/g;  # acute accent

    $string =~ s/\xb5/u/g;  # micro sign

    $string =~ s/\xb6//g;  # paragraph sign

    $string =~ s/\xb7/./g;  # middle dot

    $string =~ s/\xb8/,/g;  # cedilla

    $string =~ s/\xb9/^1/g;  # superscript 1

    $string =~ s/\xba//g;  # masculine ordinal indicator

    $string =~ s/\xbb/"/g;  # right-pointing double angle quotation mark

    $string =~ s/\xbc/1\/4/g;  # fraction one quarter

    $string =~ s/\xbd/1\/2/g;  # fraction one half

    $string =~ s/\xbe/3\/4/g;  # fraction three quarters

    $string =~ s/\xbf/?/g;  # inverted question mark

    $string =~ s/[\xc0\xc1\xc2\xc3\xc4\xc5]/A/g;  # various A's

    $string =~ s/\xc6/AE/g;  # capital ligature AE

    $string =~ s/\xc7/C/g;  # capital letter C with cedilla

    $string =~ s/[\xc8\xc9\xca\xcb]/E/g;  # various E's

    $string =~ s/[\xcc\xcd\xce\xcf]/I/g;  # various I's

    $string =~ s/\xd0/D/g;  # capital eth

    $string =~ s/\xd1/N/g;  # N with tilda

    $string =~ s/[\xd2\xd3\xd4\xd5\xd6\xd8]/O/g;  # various O's

    $string =~ s/\xd7/x/g;  # multiplication sign

    $string =~ s/[\xd9\xda\xdb\xdc]/U/g;  # various U's

    $string =~ s/\xdd/Y/g;  # Y with acute

    $string =~ s/\xde/Th/g;  # capital THORN

    $string =~ s/\xdf/ss/g;  # (German) ess-zed

    $string =~ s/[\xe0\xe1\xe2\xe3\xe4\xe5]/a/g;  # a with various
    decorations

    $string =~ s/\xe6/ae/g;  # small ligature ae

    $string =~ s/\xe7/c/g;  # small letter c with cedilla

    $string =~ s/[\xe8\xe9\xea\xeb]/e/g;  # e with various decorations

    $string =~ s/[\xec\xed\xee\xef]/i/g;  # i with various decorations

    $string =~ s/\xf0/./g;  # small letter eth (not in roman alphbet)

    $string =~ s/\xf1/n/g;  # small letter n with tilde

    $string =~ s/[\xf2\xf3\xf4\xf5\xf6\xf8]/o/g;  # o with various
    decorations

    $string =~ s/\xf7/\\/g;  # division symbol

    $string =~ s/[\xf9\xfa\xfb\xfc]/u/g;  # u with various decorations

    $string =~ s/[\xf9\xfa\xfb\xfc\xfd\xfe]/a/g;  # a with various
    decorations

    $string =~ s/[\xfd\xff]/y/g;  # y with various decorations

    $string =~ s/\xfe/./g;  # small letter thorn (not in Roman alphabet)

    -- 

Alan D. Mead, Ph.D.
President, Talent Algorithms Inc.

science + technology = better workers

http://www.alanmead.org

Keep away from people who try to belittle your ambitions. Small
people always do that, but the really great make you feel that
you, too, can become great.

-- Mark Twain