Re: Transitioning to a SQL db

"Karl O. Pinc" <kop@xxxxxxxx> · Fri, 8 Apr 2016 10:04:33 -0500

Hi Tim,

As arranged I am cc-ing the pgsql-general list in the hope
they will assist.  Your posts to the list may be delayed for
moderation, I can't say.

It could be helpful if you subscribed to the list, but it
is relatively high traffic and I know you have extremely limited
and expensive bandwidth.  For this reason I'm not sure I can
recommend subscribing.

On Thu, 7 Apr 2016 12:22:26 +0200
Tim Vink <timvink@xxxxxxxxx> wrote:

> Many thanks for your kind offer to give us some advice. I am Tim Vink,
> Research Techician at the Kalahari Research Trust and currently
> Database and Networks manager of the project. Our Project leaders,
> Chris and Laura at the Meerkat project mainly involved with the main
> Meerkat database that is currently in access, where Chris is helping
> me develop the additional databases (that are currently loose db
> and/or files (read up to 80000 csv or mapsource files) and make a
> coherent structure for these.
> 
> We are in the starting position and started off with MySQL using the
> Percona Server variant for more advanced / easy replication and
> databackups, would you recommend to move to PostGreSQL, what would be
> our main advantages over MySQL?

I no longer keep up with the MySQL feature set, and can't be
considered an expert in the overall status of databases
in the FOSS world.  But I am not entirely disconnected either
so should be of some help.

I can't give you a point-by-point comparison of MySQL and Postgres.
Overall, the difference is in approach.  The goal of Postgres
is to be SQL standards compliant, to be well designed,
to be ACID compliant, and to be reliable.  

The attention to design is most significant.  MySQL was written
as glue, layering an SQL interface on top of a number of underlying
database engines.  And they were looking for speed.  ACID compliance
was something that came later.  As a result, in MySQL there are
myriad corner cases and rough edges.  Postgres does not have these.
This really starts to matter for people who don't spend all their
time living cuddled up to their database and getting to know
all it's quirks.

Postgres takes the time to ensure each new feature is "clean".
Although the emphasis is not on performance, this also includes
performance.  Again reliability is paramount.  Postgres goes to
great lengths to ensure that the data in the database is
never corrupted.  I already mentioned that it won't restore
a database dump unless the resulting referential integrity is intact.

A few minor examples:  PG distinguishes
between an empty string and NULL, the indeterminate data value.
Most other dbs represent the empty string as a string containing
a single space, and there is no distinction between an empty
string and a single space.  You can, with care, construct views in 
PG that act in every way like regular tables, they can be updated,
inserted into, deleted from, etc.  This is a great boon when
users, such as in your case, work directly with the db.  The
PG SQL variant is a "clean", "typical" variant -- generally tracking
Oracle.  (The SQL standard is huge and awful and allows
just about anything that any major vendor wanted to cram in.)
There are no wierdo ways to write SQL that supply hints as to
how to optimize the query, the PG query optimizer does the
right thing for you.  PG has nothing like the variation
in SQL case sensitivity depending on underlying OS platform
like you find in MySQL.  In PG transactions apply everywhere,
even to meta-data like table creation.  This is unusual,
and very useful when making test alterations to a test
database as it eliminates error-prone cleanup on failure
and consequential lack of synchronization between test and
production databases.

The PG "window functions" (see the PG docs) are also incredibly
powerful for data analysis.  As is the ability to embed
R (r-project.org) into Postgres.  (Although embedding R
is dangerous from a security standpoint since it's then
reasonably impossible to prevent a PG user from writing
arbitrary files to disk.  This breaks the barrier between
db access and OS access.)

You may also be interested in PostGIS for geospatial
integration.  (postgis.org)
It is "way cool".

Regards Percona Server, it's FAQ says it tracks Oracle's MySQL.
I see the open source community moving away from Oracle's
MySQL to MariaDB and this could be a long-term concern.

> I take great example in your babase and ambaselli baboon project work.
> Where we have started (still very very much under development) a
> meerkat wiki. (meerkat.kalahariresearch.org)

You may also want to look at gombemi.ccas.gwu.edu for some
work I did for some of the Jane Goodall folk.  It has
some more advanced, from the standpoint of program
internals, web-based tools.

We use 2 idioms extensively.  The first is batch upload
from csv files.  This makes converting data to electronic
form, usually via MS Excel, a low-skill task.
We upload with custom programs, and with
generic uploads to tables and views.  We have a generic upload
program that ensures you get error messages for each line uploaded.
We also have a wwwdiff program for paranoid data checking.
(Have 2 people enter the data, independently, and then compare the
result before upload to find typos.)  Since our users interact with the
database directly our data integrity is checked via an
extensive set of triggers, ensuring that common data errors
never get in the db.   (Of course there can always be
typos in entered data.)

Our users use a data maintenance idiom for SQL manipulation
that looks like: begin transaction, make changes, run some queries
to test changes, rollback transaction.   The
SQL is submitted in a single "hunk" to the db. This is repeated until 
the data looks right and then the final rollback is changed
to a commit.  This idiom presents problems with some GUI front-ends.
We had to modify phpPgAdmin to support it.  You might want
to look at adminer (www.adminer.org), which does support the
idiom.  (Coming from MS Access you may also find the adminer
graphical GUI query builder attractive.  As a rule I find
these sorts of things more dangerous than not -- users
tend to write queries they don't understand -- but there's
surely good potential.)

In general, there's no comprehensive web based solution for
PG that "does everything".  E.g.  Most tools will want to stop
after the first error on data upload.  Some will refuse
to upload into views.  Etc.  You can try using the Babase 
tools if you like.  I have aspirations to build something
better for generic data upload/download but they remain in
the planning stage for the forseeable future.  (See also:
https://wiki.postgresql.org/wiki/Community_Guide_to_PostgreSQL_GUI_Tools
)

You are of course welcome to our hacked phpPgAdmin.

If you get a chance (and use PG) you should probably utilize
the Gombe-MI/Babase warning system.  (I should really
package this separately, somehow.)  It's a batch oriented
data integrity checking tool based on user-supplied
SQL queries that look for trouble.  See the on-line docs.

As far as a web stack goes I gravitate toward the following:
Linux, Nginx, PostgreSQL, Python, Pyramid, SQLAlchemy, Mako, Bootstrap,
Bootswatch.  I used to work in PHP, but that was more because
there used to be few alternatives regards scripting for the
web.

As an FYI, I find Debian to be more stable than Ubuntu.  It
also integrates very nicely with PG.  The official PG repos
are one of the very very few non-Debian repos which are
recommended for inclusion in Debian.

Since you're Internet-limited you may also find rsync
interesting.  (Hardlinked backups rock.)
I have an rsync based backup script you
are welcome to, but it serves my specific needs.  It
is often better to go with a more generic even if less
serviceable tool like rsnapshot.

> I would be happy to have a skype conversation or continue our
> conversation over email / skype chat. 

FYI to the list, we will take our chat to #postgresql
if and when we chat.

Tim, if you do decide to try PG go to #postgresql for help
with the initial configuration.  You'll want to configure
PG so that 40% of RAM (up to 4G?) is shared memory
and used by PG.  And adjust work_mem accordingly.
(And if the box does nothing but PG it's sometimes
useful to configure as much ram as possible as shared
memory.)  It can also be confusing to setup permissions
for database access the first time you do it.

(We give each user his own login, and his own schema
for personal work.  Some users get 2 logins, one
for regular work and one "superuser" login for
special purposes like creating new users.  Each
login is in one of 2 groups.  There's a read-only
group and a read-write group.  These groups
apply to the master, production, schema.
The read-write group does not get to alter
db structure, just alter table content.
People can do whatever they want in their
own schemas.)

To end, since you're coming off of MS Access, you may
not have a lot of experience in the FLOSS world.  One
key to success is picking the right components. Unlike
in the proprietary world, and excepting distros, there are few
one-stop shopping projects.  Individual projects are typically
tailored to doing one thing and doing it well.  You add
more projects into your installed mix to add additional
functionality.  Ask the community to find components.

Regards,

Karl <kop@xxxxxxxx>
Free Software:  "You don't pay back, you pay forward."
                 -- Robert A. Heinlein

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general