I have to deal with the following situation.
I get a daily feed of data for several tables. This
data is retrieved from the supplier by ftp in the form of CSV
text files. I know I can load the files using "load data". The
relational structure between the files and tables is simple: one file contains
data for a lookup table. But the data in the lookup table is dynamic,
potentially changing from one day to the next (though in practice, the rate of
change is likely to be very slow). I will also need to maintain a history
in such a way that any and all changes in the data can be fully
audited. This means that, for auditing purposes, I will need to be able to
extract valid time for each item in the lookup table. The only likely
change in the data will be that occassionally items will be added or removed
from the lookup table. None of the data already present in the database
will be editable (data validation happens before I get the feed). The only
analysis of which I have been told is done daily, as the feed is received.
At present, no-one looks at earlier data (although I expect that will change as
I create a better structure for the database to support audits).
I am trying to create a flexible design so that
refactoring will be simple if and when the assumptions or practices need to be
changed.
I know how to handle all this using a brute force
approach, but I expect that approach will be painfully slow. So here is
what I am planning.
1) create a suite of tables corresponding to the files in
the feed, with the addition of a date/time stamp containing the date and time on
which the data being processed was received.
2) create a suite of temporary tables corresponding to the
tables created in step 1 (but without the date/time stamp)
3) load the data into the temporary tables
4) analyse the data while it is in the temporary tables,
storing the result of the analysis in new tables
5) copy the data into the permanent tables, and add the
date and time stamp for the data (this date/time stamp is not present in the
files retrieved).
6) free the temporary tables
Now, this second last step is brute force, adequate
for all but one of the tables: the look up table. If I stick with the
brute force approach, the lookup table will waste a significant amount of
space. This won't be much initially, but it is guaranteed to get worse as
time passes and I'd expect the lookup performance to degrade as the amount of
data in the lookup table increases.
Each record in the lookup table represents a product, and
both the name and the product will have valid time intervals that may not be
related. The name may change because the product has been renamed for
whatever reason (and we don't care why), or the name may disappear altogether
because the product has been discontinued. We can distinguish the two
cases because each product has an ID that remains valid while the product
exists, and the ID won't be in the data at all if the product is
discontinued.
I am considering creating an additional table just to
lookup product names, but with two date and time stamps. The first would
represent the first time the product name appears in the data and the last
would represent the last time the product name is present in the
data. The first of these would be edited only once, and that is on the
first day for which we have data. I am torn between updating the last of
these every day, until the name disappears, or leave it null until the name
disappears. leaving it null would save on space, but updating it with the
current data and time should save time since it would not be necessary to
execute a complex conditional on every product ever included in the
database. If we update it only for those items in today's data, those that
were terminated before today will not have their date/time stamp updated, so the
two fields will always represent the time interval for which the name is valid.
A similar logic applies to the product ID.
Have I overlooked anything that is obvious to you?
Any gotchas I should be aware of?
What opportunities for improving performance do you
see?
The ultimate objective is to have this fully automated
from a shell script that is configured by our sysop to execute at a specific
time every day Monday through Friday. Can all of the steps I describe
above be implemented using ANSI standard SQL, and thus put into a stored
procedure, or should I look at doing some of it in a Perl script or java
application? I suppose I will have to have at least a basic Java
application, or perl script, if only to connect to the database and invoke any
functions I have created to do this.
Any suggestions would be appreciated.
Thanks,
Ted
R.E. (Ted) Byers, Ph.D., Ed.D.
R & D Decision Support Solutions http://www.randddecisionsupportsolutions.com/ |