----- Original Message -----
From: "Zoltán Németh" <znemeth@xxxxxxxxxxxxxx>
In formal english, it's not allowed to use 've 'm etc, I'm should be
written as I am. So that's not gonna work i think.
But words like and are really english i think :)
Keep in mind that this is quite a hard way i think, but i don't have a
better solution.
Just for example, Dutch and Afrikaans are not very different, so it's
really hard to see which of the 2 the text is written in.
Tijnema
ps. If you can't get the difference between Dutch and Afrikaans, guess
for Dutch :) It's a lot more used then Afrikaans.
yeah, looking for very frequently used words seems better idea.
greets
Zoltán Németh
In Spanish, as it happens with many languages that use diacritical marks, in
informal chatting you often skip them. This has a long tradition in the
internet since years ago the support for those extra characters was
non-existent and today it is still somewhat patchy. I used to have two
modes of writing in Spanish, formal writing with all proper accents, tilde
and umlauts and email mode, without any of those. Nowadays, with support
for languages using the Roman alphabet widely available, there is no need to
omit diacritical marks, but you will often find them missing, particularly
in comments to blogs and other informal writing, just because of laziness or
carelessness or simply lack of formal education and in that I include
foreigners who more or less handle the language but not the minor details.
If English had accents, I would probably skip them.
So, using a spelling dictionary is not a good idea unless you can count your
input to be properly written. A text in Spanish with its accents missing
will give you lots of errors, and we use just one sort of accent (acute)
plus tilde and umlaut. The French use three sorts of accents, there is a
far higher chance of getting misspellings. I don't know how abundant
accents are in Magyar, for me Zoltan Nemeth is the same as Zoltán Németh,
but the first is a misspelling.
This problem also affect the frequency of individual letters. Should you
first convert accented vowels to their plain version? Because if you find
accented letters, it is a sure sign that it is not English, but if there is
none, it doesn't mean it is English, it might be some non-English text
without the correct accents. Should you count 'a' and 'á' separate or add
them together because people often omit the accent?
So, I also vote for the frequently used words approach and against the
lowest number of misspellings. And I would first convert everything to
plain, with no accents, both for the needle and the haystack.
Satyam
PS: also, it is accepted practice to omit accents on uppercase letters such
as in headings. It is not gramatically correct but a typographical
convention which the printing industry has been using for ages: the accents
simply don't fit nicely.
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php