On Sat, 14 Jun 2014, Damir Dezeljin wrote:
This is more a theoretical or better to say, conceptual question; still, I
hope to get some feed backs from you folks. Additionally this is going to
be a very long post :) off-topic: I asked a similar question on the MySQL
forum as I'm still undecided if going with PostgreSQL or MySQL << I'm
tempted at PostGIS.
Postgres. You can link it to spatial attributes with postgis if you want.
But, how to structure your database and tables is independent of the
analytical methods you use.
I am designing a database for storing various biological and ecological
data. Although there is no clear dividing line, it is possible to group
the data into two groups, namely the measured (physical) and quantitative
(mostly biological) data; I uploaded both a data sample and an initial
draft of a DB model to this link
<https://www.dropbox.com/sh/9gm2ezwrwhkz6xv/AAB3koD6Xzi48-2BhIEdwmlZa>.
Of course there's a clear division between biological, chemical, physical,
geomorphical, and other data.
I may be wrong, but this looks like a question from a student for a
project or master's degree.
*Biological / quantitative data*
You have many choices: EPA's STORET or EMAP projects, ITIS (International
Taxonomic Identification System), or your own. Each type of biological data
should have its own table; e.g., vegegation, mammals, birds, fish, benthic
macroinvertebrates, microbes.
Each table should have a compound key: site_id, sample_date, species. This
makes each row unique, when possible. Otherwise. use a sequential key.
The table I use for fish data has this schema:
Table "public.fish"
Column | Type | Modifiers
-------------+-----------------------+-----------
site | character varying(12) | not null
sampdate | date | not null
tclass | character(13) | not null
torder | character varying(16) | not null
tfamily | character varying(12) | not null
tgenus | character varying(12) | not null
tspecies | character varying(12) | not null
tsubspecies | character varying(12) |
common_name | character varying(32) |
quant | integer | not null
comments | text |
basin | character varying(10) |
stream | character varying(20) |
Indexes:
"fish_pkey" PRIMARY KEY, btree (site, sampdate, tspecies)
The table I use for benthos has this schema:
Table "public.benthos"
Column | Type | Modifiers
---------------+-----------------------+-----------------------------------
------------------------
sampid | integer | not null default nextval('macroinv
_sampid_seq'::regclass)
site | character varying(12) | not null
sampdate | date | not null
tclass | character varying(20) | not null
torder | character varying(32) | not null
tfamily | character varying(50) |
tgenus | character varying(32) |
tspecies | character varying(20) |
func_feed_grp | character varying(32) |
quant | integer | not null
comments | text |
stream | character varying(20) |
basin | character varying(10) |
Indexes:
"macroinv_pkey" PRIMARY KEY, btree (sampid)
*Measured / physical data*
- Physical quantities may be outside the detection range of the measured
instrument; in such a case, this needs to be recorded. I still do not have
a clear idea how to do it. NULL’s do not seem to be a good choice to mark
such data.
Here, too, you have choices. You can incorrectly record and analyze
chemical data below detection limits like EPA, states, and most regulatory
permit holders do, or you can do it correctly. This is a schema I use for
water quality data:
Table "public.surfchem"
Column | Type | Modifiers
----------+-----------------------+-----------
site | character varying(20) | not null
sampdate | date | not null
param | character varying(32) | not null
quant | real |
ceneq1 | integer |
floor | real |
ceiling | real |
stream | character varying(20) |
basin | character varying(10) |
Indexes:
"surfchem_pkey" PRIMARY KEY, btree (site, sampdate, param)
The key is the set (site, sampdate, param) so you can have multiple
samples, each of a different chemical constituent, at the same location and
date. (This works for air, soil, and other media chemistry, too). The
concentrations must be in the same units (e.g., mg/L). The column named
'ceneq1' is a flag: when the quant value is below the analytical laboratory's
reporting limit, set the flag to '1'; otherwise 0. Now you have a consistent
way to identify rows with less-than/non-detect/censored values. Since
reporting levels change as instruments become more sensitive and analysts
become better trained, you can have multiple reporting limits in your data.
That's where the 'floor' and 'ceiling' columns come in.
This should get you started.
Rich
--
Richard B. Shepard, Ph.D.
Applied Ecosystem Services, Inc. | Troutdale, OR 97060 USA
www.appl-ecosys.com Voice: 503-667-4517 Fax: 503-667-8863