These seem to work. I'm posting them now before I'm done with them before I start messing them up with specifics to my database.
Any comment is welcome. However, I've no intention of publishing production quality code or documentation, these are my notes and I've used published them here in the event that somebody finds them useful.
Postgresql developers might note the hackery required to accomodate COPY's insertion of escape sequences (and the escape sequences in the syntax of literal strings, be nice to turn off \'s per-database.)
This uses the XBase perl module to convert from .dbf to delimited text files.
-------------------------- file:INSTALL
Postgresql can import delmited text fields (COPY). This means that the field separator and record separator delimters chosen cannot appear in the data. I choose tab as my field separator and newline as my record separator. This may not work with all data.
Use the perl program dbfdump to convert individual files.
If your distribution does not contain dbfdump you can hack an installation of the program in a temporary location by following this procedure:
Obtain the code for the XBase module (DBD-XBase-0.241.tar.gz) from http://www.cpan.org. Place the tarfile in some temporary directory and cd into that directory.
Install the XBase module (see http://www.cpan.org/misc/cpan-faq.html#How_install_private) with: (do not do this as root)
tar -xzf DBD-XBase-0.241.tar.gz (# Install XBase module cd DBD-XBase-0.241 perl Makefile.PL LIB=$(dirname $(pwd)) PREFIX=$(dirname $(pwd)) make make test make install ) (# Customize dbfdump to account for our strange install cd bin foo=foo.$$ head -n 1 dbfdump | sed "s#\$# -I $(dirname $(pwd))#" > $foo tail -n +2 dbfdump >> $foo chmod u+w dbfdump cat $foo > dbfdump chmod u-w dbfdump rm -f $foo )
The XBase version number will vary with the version you have installed.
XBase requires on the perl DBD module. If you don't have that I imagine you can so something similar to install it.
You should now be able to run the dbfdump program by explictly supplying a full pathname. Test this by typing:
yourtempdir/bin/dbfdump --help
When you're done with dbfdump you can uninstall by deleting your temporary directory.
The remainder of this document does not fully qualify the path to the dbfdump program. If you would like to simply cut and paste the code, you can (temporarly) alter your $PATH as follows:
export PATH=$PATH:yourtempdir/bin
----------------- file:dbf2txt
#!/bin/sh # # Convert foxpro files to tab delimited text files # # # Put all your dbf related files into a directory, in this case dbfdir. # Make a directory for converted output, in this case txtdir.
export PATH=$PATH:./xbase_bin
for suffix in DBF dbf ; do for f in $(ls dbfdir/*.$suffix) ; do # Choose a character that doesn't appear in the data as the # field separator. I choose tab. dbfdump -fs $'\t' $f > txtdir/$(basename $f $suffix)txt done done
------------------------------------------------------------- file: loader
#!/bin/sh # load the data from text files
# Needs directories (in the pwd): # dbfdir all the foxpro .dbf files # txtdir creates tab delimited text files here # rejects tab delimited text files of rows that won't convert
# We deal with bad data as follows: # The raw data row is inserted into a temporary table. # Then we start a transaction, insert the data into the # real table and delete the raw data from the temporary table. # If the real data does not insert, then the whole tranaction # is rolled back and the raw data is left in the temporary # table. After completing the load, the bad data is left # in the temporary table and we dump that back out to the # filesystem.
function regular_load () { export table=$1 #total hack, sed command to change old col names to new export colfilter="$2"
if [ -e dbfdir/${table}.dbf ] ; then export t=$table.dbf else export t=$table.DBF fi
# Get the columns and the order in which they appear. export cols="$( ./xbase_bin/dbfdump --info dbfdir/$t \ | awk '/[1-9][0-9]*\./ {printf("%s%s", sep, $2); sep=", "}' )"
if [ -n "$colfilter" ] ; then # Convert the old column names to new column names. cols="$(echo $cols | $colfilter)"; fi
# Get the data types that go with the columns. export types="$( ./xbase_bin/dbfdump --info dbfdir/$t \ | awk '/[1-9][0-9]*\./ {printf("%s%s", sep, $3); sep=" "}' )"
awk -v cols="$cols" -v types="$types" -v table="$table" -v pwd="$(pwd)" '
function fix_escapes(s) {
# Double all backslashes and single quotes.
# See the GNU awk users guide
# http://www.gnu.org/software/gawk/manual/html_node/Gory-Details.html#Gory%20Details
return gensub("\\\\", "\\\\\\\\", "g", gensub("\x27", "\x27\x27", "g", s));
}
BEGIN {split(types, t);
FS = "\t";
# Create the temporary table.
print "CREATE TEMP TABLE rejects(id INT, rawdata TEXT);";
id = 1;
}
{# First insert the raw data into the temp table.
printf("INSERT INTO rejects (id, rawdata) values(%s,\x27%s\x27);\n" \
, id \
, fix_escapes($0));
# Then update the table in a transaction that also removes # the raw row. Either it all succeeds or it does not. print "BEGIN TRANSACTION;"; printf("insert into " table "(" cols ") values ("); sep = ""; for (i = 1; i <= NF; i++) { printf("%s", sep); if ($i == "") { # No data, try inserting a NULL value printf("NULL"); } else { # There is data. if (t[i] != "N") { # Data is not numeric, needs quotes. printf("\x27"); # single quote char }; # Ta Da! The data. # Double all backslashes and single quotes. printf("%s", fix_escapes($i)); if (t[i] != "N") { # Data is not numeric, needs closing quotes. printf("\x27"); # single quote char }; } sep = ", "; }; print ");"; # Delete the raw data from the temp table. printf("DELETE FROM rejects WHERE id = %s;\n", id); id++; print "END TRANSACTION;"; } END {# Dump the rejects back to a file. print "\\copy rejects (rawdata) TO \x27" pwd "/rejects/" \ table ".txt\x27"; }' \ txtdir/$table.txt \ | psql -d example_db -h server.example.com -f - -U example_user
# Convert rejected data from sql COPY TO format back to tab separated
# data so we can fix it and feed it right back to this program.
# (Welcome to escape hell.)
# Note that this assumes that there are no vertical tabs, newlines,
# or other such characters that COPY would mung.
tfile=$table.txt.$$
mv rejects/$table.txt rejects/$tfile
sed 's/\\\\/\\/g' rejects/$tfile | sed 's/\\t/'$'\t''/g' > rejects/$table.txt
rm rejects/$tfile
}
regular_load acts 'sed s/OLD/RETIRED/'
for t in STATUSES BSTATS MSTATS dcauses rnktypes ; do regular_load $t done
Karl <kop@meme.com> Free Software: "You don't pay back, you pay forward." -- Robert A. Heinlein
---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly