Re: Multiline plpython procedure

Marco Colombo <marco@xxxxxx> · Fri, 21 Jan 2005 16:41:57 +0100 (CET)

On Fri, 21 Jan 2005, Martijn van Oosterhout wrote:

On Fri, Jan 21, 2005 at 12:02:09PM +0100, Marco Colombo wrote:
On Fri, 21 Jan 2005, Greg Stark wrote:
I don't think it's reasonable for pg_dump to think about converting
data from one language to another. It's important for pg_dump to
restore an identical database. Having it start with special case
data conversation from one flavour to another seems too dangerous.

Makes no sense. pg_dump already make a lot of conversions: from internal
representation (which may be platform dependent) to some common format,
say text. It's just multi-line text which is a hard to deal with, because
there _no_ single format for it. pg_dump may just choose one format, and
stick with it. Every dump/restore will work. You may have trouble editing
a text dump, but that's another matter. BTW, what pg_dump does on windows?
I mean with -F p. Does it produce a text file with CRNL line seperator?
What happens if you feed that file to psql on a Unix box?

Ah, but you see, looking at it from your point of view, pg_dump doesn't
interpret text strings. For example, the python script in a function is
a opaque string. Not multiline, nothing. All postgresql does is pass
that block of opaque data to the interpreter for that language. pg_dump
dumps that opaque data into the output, and the CREATE FUNCTION dumps
that opaque data back into the system tables. Postgresql doesn't
understand python any more or less than perl, tcl, R or any other
language.

I was referring to psql output in general.
E.g. (comments stripped):
CREATE TABLE t2 (
    f1 text
);

COPY t2 (f1) FROM stdin;
test1
test2
test3
\.

This dump, produced on Unix, will have lines separated by \n. What does the
same dump produced on Windows look like? If it's \n separated, it's not
editable (natively) on Windows. Which is fine to me, we just defined pg_dump
textual output to be \n terminated, always. Or, it's \r\n terminated. If so,
how would it be to restore it on a Unix box (with psql -f). Now, if the
data contains a \r I think it shows like that, escaped. Whether intended
or not, that's the only thing that saves us (note that there's no need
to escape a bare \r in Unix).

The argument here is that basically this opaque data has different
meanings for Python on windows and Python on unix. You can't make any
special cases because I can rename plperl.so to plpython.so (or
vice-versa) the opaque data won't be passed to the interpreter that
you'd expect from looking at the definition.

I'm for defining a format used by PostgreSQL, and force the python parser
into accepting it on all platforms. That is, let's set the rule that
python programs to be embedded into PostgreSQL use \n as line termination.

Wouldn't that disadvantage non-unix pl/python users, whose python
functions would have to be converted at run-time to conform to the
local text format. With the extra bummer that the resulting string may
not be the same size either. Remember, postgresql uses the standard
shared library for the language on the platform, it doesn't build its
own.
But sure, preprocessing the source at run-time seems to be the only
realistic solution without a change to the interpreter.

Yeah. My fav. solution is to convert the string to platform format before
passing it to the parser. See the martian example.

Think of this: tomorrow we meet people from Mars. One of them really likes
PostgreSQL, and ports it to their platform. Being a martian platform, it
uses a different text file format. Line separator there is the first 1000

<snip>

Spurious argument. You're assuming Martians would use ASCII to write
programs without using one of the two defined line-ending characters.
If they were smart they'd simply use a character set which doesn't have
the ambiguity. If they even use 8-bit bytes. An ASCII C compiler won't
compile EBCDIC source code either, but nobody thinks that's
unreasonable, probably because nobody uses EBCDIC anymore :).

You missed the point. Charset has nothing to do with the issue.
While you can handle both at the same time, they are unrelated.
Line separator is not dictated by the charset, only by the platform.
\r\n or \n or \r for line termination is _not_ defined by ASCII.
The _same_ ASCII textfile looks differently when looked in binary mode
on various platforms. The point was: what if someone introduces
another platform with yet-another-line-termination-standard?
It's unlikely, just like martians. But it makes you realize that
conversion is the job of the software that handles inter-platform
communication (much like FTP).

No-one is complaining about the use of line-ending characters, they
could have said that you need a semi-colon to seperate "lines". The
problem is that it's *not consistant* across platforms.

Have a nice day,

What about C? How about fopen("afile", "r") in C? Is it "portable"?
Or should you use: fopen("afile", "rb")? Define "consistant across
platforms" here. If you use "rb", your program will be consistant
in that with the same _binary_ input, produces the same _binary_
output. But if it's supposed to handle text files, it will fail.
That is, it is consistant if it is supposed to handle binary data,
it is not if it is supposed to handle text files. If you use "r",
it's the opposite. No matter what, your program will never be
completely consistant! You have to decide if it handles text file
_or_ binary data (unless you make runtime detection, that is, of
course - but that's another matter. Under Windows you can assume
a .txt file is "text". Under Unix things are not that simple).

Think of the meaning of '$' in a regular expression. What (binary)
character(s) does it match? I expect it to match \n under Unix and
the sequence \r\n under Windows. What is the usage scope of '$'?
A multiline text. If you look at the data you're using it on as
_binary_ data, it's behaviour it's inconsistant.

Face it, _every_ time you're handling multiline text data, you
should know in advance what separator it uses. If handling includes
moving across platforms, you should take care of conversion, _before_
you pass it to an external program that expects textual input.

Try and read the binmode() entry in the Perl manual. In particular:

" For the sake of portability it is a good idea to always

  use it when appropriate, and to never use it when it isn't

  appropriate."

That is, you should be well aware of that type of data you're

handling, and handle it correctly. Burying your head in the sand 
and say "well I treat it as binary opaque data, so I'm fine" is

calling for problems. Expecially when you're moving it across platform.

Otherwise, you _define_ it to be binary data (and users may have
problems in reading it as text).

.TM.
--
      ____/  ____/   /
     /      /       /			Marco Colombo
    ___/  ___  /   /		      Technical Manager
   /          /   /			 ESI s.r.l.
 _____/ _____/  _/		       Colombo@xxxxxx

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
     joining column's datatypes do not match