Re: WORD to text problem

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 5/5/2020 10:08 AM, Tedd Sperling wrote:

      
On May 3, 2020, at 3:29 PM, Ashickur Rahman <ashickur.noor@xxxxxxxxx> wrote:

Did you try <textarea> for input. I believe it will eliminate  all text formating. 

Hmmm?  Why didn’t I think of that?

Why? Because I have had clients in the past who have used Word and it caused problems. But now, as per your suggestion, I’ve tested it and the problem seems to have disappeared. Interesting...

Thanks 

Tedd

PS: Who said you can’t teach an old dog new tricks. Moof!

Tedd,

I'm working on this problem myself (but my project is in Perl). I think the OS actually picks what kind of data to paste based on the target, but things like italics and bold get removed because they're not characters. If the paste target advertised itself as having rich text, then I think Windows (for example) would paste content with bold, italics, font styles and sizes, etc.

But those effects don't get pasted because they're not based on characters, however copied characters are pasted and that remains an issue. And the bane of my existence are "smart quotes" but bullets, ellipses, dashes, etc. are also an issue, as are characters required for non-English alphabets which is particularly a problem for certain words (resumé) and for names. These are all extended "Latin1" characters (i.e., not ASCII) and their interpretation is different in different encodings and this is what causes the problem.

And you can use a regex to fix these, but you need to guess the encoding. A regex that works for Latin1 will fail for UTF8 (in my experience).

I've been following your thread to see what people suggest about this. I guess the basic "Joel Spolsky" advice for programmers is to always know the encoding and use the correct one, which makes perfect sense if you're the one controlling the data. But you and I are in a situation where the user gets to pick and it will vary and I don't think they can tell us (if they knew, which they don't). I haven't solved this problem, but I assume that we have to detect the encoding.

Anyway, in case it helps you, I have included a long list of regex replacements I developed (in Perl) to replace problematic characters. These regexes look for hex codes, like 0x85 is Word's ellipsis character, and change them into an approximate (sometimes very approximate) string that can be represented in ASCII. It covers all the characters that I thought I might encounter from Word. These might be too specific to Word for Windows using the English. And these fail for some files with an unexpected encoding. Let me know if the regex is unclear (I think Perl and PHP share a regex syntax, but the syntax to perform a regex differs?).

I was actually hoping someone chimed in and described an easier way to deal with this issue. I hope this is not the best way to fix this problem.

-Alan

# fix HTML entities

$string =~ s/&#821[67];/'/gsm; # single quote
$string =~ s/&#821[12];/--/gsm; # em dash
$string =~ s/&#821[89];/'/gsm; # single quote
$string =~ s/&#822[0123];/"/gsm; # double quote
$string =~ s/&#8226;/+/gsm; # bullet
$string =~ s/&#8230;/.../gsm; # ellipsis

# fix extended Latin 1/ISO-8859-1 characters

$string =~ s/\x85/.../gsm; # elipsis
$string =~ s/[\x91\x92]/'/gsm; # single quotes
$string =~ s/[\x93\x94]/"/gsm; # double quotes
$string =~ s/\x95/*/gsm;  # bullet
$string =~ s/[\x96\x97]/--/gsm; # dashes

$string =~ s/\xa0/ /gsm;  # non-breaking space
$string =~ s/\xa1/\!/g;  # inverted exclamation mark
$string =~ s/\xa2//g;  # cent sign
$string =~ s/\xa3//g;  # pound sign
$string =~ s/\xa4//g;  # currency sign
$string =~ s/\xa5//g;  # yen/yuan sign
$string =~ s/\xa6/\|/g;  # pipe
$string =~ s/\xa7//g;  # section sign
$string =~ s/\xa8//g;  # diaerasis
$string =~ s/\xa9/(c)/g;  # copyright sign
$string =~ s/\xaa//g;  # feminine ordinal indicator
$string =~ s/\xab/"/g;  # left-pointing double angle quotation mark
$string =~ s/\xac/-/g;  # not sign
$string =~ s/\xad//g;  # soft hyphen = discretionary hyphen
$string =~ s/\xae/(R)/g;  # registered sign = registered trade mark sign
$string =~ s/\xaf/-/g;  # macron = spacing macron = overline = APL overbar

$string =~ s/\xb0//g;  # degree sign
$string =~ s/\xb1/+-/g;  # plus or minus
$string =~ s/\xb2/^2/g;  # squared
$string =~ s/\xb3/^3/g;  # cubed
$string =~ s/\xb4/'/g;  # acute accent
$string =~ s/\xb5/u/g;  # micro sign
$string =~ s/\xb6//g;  # paragraph sign
$string =~ s/\xb7/./g;  # middle dot
$string =~ s/\xb8/,/g;  # cedilla
$string =~ s/\xb9/^1/g;  # superscript 1
$string =~ s/\xba//g;  # masculine ordinal indicator
$string =~ s/\xbb/"/g;  # right-pointing double angle quotation mark
$string =~ s/\xbc/1\/4/g;  # fraction one quarter
$string =~ s/\xbd/1\/2/g;  # fraction one half
$string =~ s/\xbe/3\/4/g;  # fraction three quarters
$string =~ s/\xbf/?/g;  # inverted question mark

$string =~ s/[\xc0\xc1\xc2\xc3\xc4\xc5]/A/g;  # various A's
$string =~ s/\xc6/AE/g;  # capital ligature AE
$string =~ s/\xc7/C/g;  # capital letter C with cedilla
$string =~ s/[\xc8\xc9\xca\xcb]/E/g;  # various E's
$string =~ s/[\xcc\xcd\xce\xcf]/I/g;  # various I's

$string =~ s/\xd0/D/g;  # capital eth
$string =~ s/\xd1/N/g;  # N with tilda
$string =~ s/[\xd2\xd3\xd4\xd5\xd6\xd8]/O/g;  # various O's
$string =~ s/\xd7/x/g;  # multiplication sign
$string =~ s/[\xd9\xda\xdb\xdc]/U/g;  # various U's
$string =~ s/\xdd/Y/g;  # Y with acute
$string =~ s/\xde/Th/g;  # capital THORN
$string =~ s/\xdf/ss/g;  # (German) ess-zed

$string =~ s/[\xe0\xe1\xe2\xe3\xe4\xe5]/a/g;  # a with various decorations
$string =~ s/\xe6/ae/g;  # small ligature ae
$string =~ s/\xe7/c/g;  # small letter c with cedilla
$string =~ s/[\xe8\xe9\xea\xeb]/e/g;  # e with various decorations
$string =~ s/[\xec\xed\xee\xef]/i/g;  # i with various decorations

$string =~ s/\xf0/./g;  # small letter eth (not in roman alphbet)
$string =~ s/\xf1/n/g;  # small letter n with tilde
$string =~ s/[\xf2\xf3\xf4\xf5\xf6\xf8]/o/g;  # o with various decorations
$string =~ s/\xf7/\\/g;  # division symbol
$string =~ s/[\xf9\xfa\xfb\xfc]/u/g;  # u with various decorations
$string =~ s/[\xf9\xfa\xfb\xfc\xfd\xfe]/a/g;  # a with various decorations
$string =~ s/[\xfd\xff]/y/g;  # y with various decorations
$string =~ s/\xfe/./g;  # small letter thorn (not in Roman alphabet)


-- 

Alan D. Mead, Ph.D.
President, Talent Algorithms Inc.

science + technology = better workers

http://www.alanmead.org


Keep away from people who try to belittle your ambitions. Small
people always do that, but the really great make you feel that
you, too, can become great.

-- Mark Twain



[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux