On Fri, 21 Jan 2005, Martijn van Oosterhout wrote:
On Fri, Jan 21, 2005 at 12:02:09PM +0100, Marco Colombo wrote:On Fri, 21 Jan 2005, Greg Stark wrote:I don't think it's reasonable for pg_dump to think about converting data from one language to another. It's important for pg_dump to restore an identical database. Having it start with special case data conversation from one flavour to another seems too dangerous.
Makes no sense. pg_dump already make a lot of conversions: from internal representation (which may be platform dependent) to some common format, say text. It's just multi-line text which is a hard to deal with, because there _no_ single format for it. pg_dump may just choose one format, and stick with it. Every dump/restore will work. You may have trouble editing a text dump, but that's another matter. BTW, what pg_dump does on windows? I mean with -F p. Does it produce a text file with CRNL line seperator? What happens if you feed that file to psql on a Unix box?
Ah, but you see, looking at it from your point of view, pg_dump doesn't interpret text strings. For example, the python script in a function is a opaque string. Not multiline, nothing. All postgresql does is pass that block of opaque data to the interpreter for that language. pg_dump dumps that opaque data into the output, and the CREATE FUNCTION dumps that opaque data back into the system tables. Postgresql doesn't understand python any more or less than perl, tcl, R or any other language.
I was referring to psql output in general. E.g. (comments stripped): CREATE TABLE t2 ( f1 text );
COPY t2 (f1) FROM stdin; test1 test2 test3 \.
This dump, produced on Unix, will have lines separated by \n. What does the same dump produced on Windows look like? If it's \n separated, it's not editable (natively) on Windows. Which is fine to me, we just defined pg_dump textual output to be \n terminated, always. Or, it's \r\n terminated. If so, how would it be to restore it on a Unix box (with psql -f). Now, if the data contains a \r I think it shows like that, escaped. Whether intended or not, that's the only thing that saves us (note that there's no need to escape a bare \r in Unix).
The argument here is that basically this opaque data has different meanings for Python on windows and Python on unix. You can't make any special cases because I can rename plperl.so to plpython.so (or vice-versa) the opaque data won't be passed to the interpreter that you'd expect from looking at the definition.
I'm for defining a format used by PostgreSQL, and force the python parser into accepting it on all platforms. That is, let's set the rule that python programs to be embedded into PostgreSQL use \n as line termination.
Wouldn't that disadvantage non-unix pl/python users, whose python functions would have to be converted at run-time to conform to the local text format. With the extra bummer that the resulting string may not be the same size either. Remember, postgresql uses the standard shared library for the language on the platform, it doesn't build its own. But sure, preprocessing the source at run-time seems to be the only realistic solution without a change to the interpreter.
Yeah. My fav. solution is to convert the string to platform format before passing it to the parser. See the martian example.
Think of this: tomorrow we meet people from Mars. One of them really likes PostgreSQL, and ports it to their platform. Being a martian platform, it uses a different text file format. Line separator there is the first 1000
<snip>
Spurious argument. You're assuming Martians would use ASCII to write programs without using one of the two defined line-ending characters. If they were smart they'd simply use a character set which doesn't have the ambiguity. If they even use 8-bit bytes. An ASCII C compiler won't compile EBCDIC source code either, but nobody thinks that's unreasonable, probably because nobody uses EBCDIC anymore :).
You missed the point. Charset has nothing to do with the issue. While you can handle both at the same time, they are unrelated. Line separator is not dictated by the charset, only by the platform. \r\n or \n or \r for line termination is _not_ defined by ASCII. The _same_ ASCII textfile looks differently when looked in binary mode on various platforms. The point was: what if someone introduces another platform with yet-another-line-termination-standard? It's unlikely, just like martians. But it makes you realize that conversion is the job of the software that handles inter-platform communication (much like FTP).
No-one is complaining about the use of line-ending characters, they could have said that you need a semi-colon to seperate "lines". The problem is that it's *not consistant* across platforms.
Have a nice day,
What about C? How about fopen("afile", "r") in C? Is it "portable"? Or should you use: fopen("afile", "rb")? Define "consistant across platforms" here. If you use "rb", your program will be consistant in that with the same _binary_ input, produces the same _binary_ output. But if it's supposed to handle text files, it will fail. That is, it is consistant if it is supposed to handle binary data, it is not if it is supposed to handle text files. If you use "r", it's the opposite. No matter what, your program will never be completely consistant! You have to decide if it handles text file _or_ binary data (unless you make runtime detection, that is, of course - but that's another matter. Under Windows you can assume a .txt file is "text". Under Unix things are not that simple).
Think of the meaning of '$' in a regular expression. What (binary) character(s) does it match? I expect it to match \n under Unix and the sequence \r\n under Windows. What is the usage scope of '$'? A multiline text. If you look at the data you're using it on as _binary_ data, it's behaviour it's inconsistant.
Face it, _every_ time you're handling multiline text data, you should know in advance what separator it uses. If handling includes moving across platforms, you should take care of conversion, _before_ you pass it to an external program that expects textual input.
Try and read the binmode() entry in the Perl manual. In particular:
" For the sake of portability it is a good idea to always
use it when appropriate, and to never use it when it isn't
appropriate."
That is, you should be well aware of that type of data you're
handling, and handle it correctly. Burying your head in the sand and say "well I treat it as binary opaque data, so I'm fine" is
calling for problems. Expecially when you're moving it across platform.
Otherwise, you _define_ it to be binary data (and users may have problems in reading it as text).
.TM. -- ____/ ____/ / / / / Marco Colombo ___/ ___ / / Technical Manager / / / ESI s.r.l. _____/ _____/ _/ Colombo@xxxxxx
---------------------------(end of broadcast)--------------------------- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match