Double checking my logic?

"Ted Byers" <r.ted.byers@xxxxxxxxxx> · Mon, 20 Mar 2006 11:25:30 -0500

I have to deal with the following situation.

I get a daily feed of data for several tables.  This 
data is retrieved from the supplier by ftp in the form of CSV 
text files.  I know I can load the files using "load data".  The 
relational structure between the files and tables is simple: one file contains 
data for a lookup table.  But the data in the lookup table is dynamic, 
potentially changing from one day to the next (though in practice, the rate of 
change is likely to be very slow).  I will also need to maintain a history 
in such a way that any and all changes in the data can be fully 
audited.  This means that, for auditing purposes, I will need to be able to 
extract valid time for each item in the lookup table.  The only likely 
change in the data will be that occassionally items will be added or removed 
from the lookup table.  None of the data already present in the database 
will be editable (data validation happens before I get the feed).  The only 
analysis of which I have been told is done daily, as the feed is received.  
At present, no-one looks at earlier data (although I expect that will change as 
I create a better structure for the database to support audits).

I am trying to create a flexible design so that 
refactoring will be simple if and when the assumptions or practices need to be 
changed.

I know how to handle all this using a brute force 
approach, but I expect that approach will be painfully slow.  So here is 
what I am planning.

1) create a suite of tables corresponding to the files in 
the feed, with the addition of a date/time stamp containing the date and time on 
which the data being processed was received.
2) create a suite of temporary tables corresponding to the 
tables created in step 1 (but without the date/time stamp)
3) load the data into the temporary tables
4) analyse the data while it is in the temporary tables, 
storing the result of the analysis in new tables
5) copy the data into the permanent tables, and add the 
date and time stamp for the data (this date/time stamp is not present in the 
files retrieved).
6) free the temporary tables

Now, this second last step is brute force, adequate 
for all but one of the tables: the look up table.  If I stick with the 
brute force approach, the lookup table will waste a significant amount of 
space.  This won't be much initially, but it is guaranteed to get worse as 
time passes and I'd expect the lookup performance to degrade as the amount of 
data in the lookup table increases.

Each record in the lookup table represents a product, and 
both the name and the product will have valid time intervals that may not be 
related.  The name may change because the product has been renamed for 
whatever reason (and we don't care why), or the name may disappear altogether 
because the product has been discontinued.  We can distinguish the two 
cases because each product has an ID that remains valid while the product 
exists, and the ID won't be in the data at all if the product is 
discontinued.  

I am considering creating an additional table just to 
lookup product names, but with two date and time stamps.  The first would 
represent  the first time the product name appears in the data and the last 
would represent the last time the product name is present in the 
data.  The first of these would be edited only once, and that is on the 
first day for which we have data.  I am torn between updating the last of 
these every day, until the name disappears, or leave it null until the name 
disappears.  leaving it null would save on space, but updating it with the 
current data and time should save time since it would not be necessary to 
execute a complex conditional on every product ever included in the 
database.  If we update it only for those items in today's data, those that 
were terminated before today will not have their date/time stamp updated, so the 
two fields will always represent the time interval for which the name is valid. 
A similar logic applies to the product ID.

Have I overlooked anything that is obvious to you?  
Any gotchas I should be aware of?

What opportunities for improving performance do you 
see?

The ultimate objective is to have this fully automated 
from a shell script that is configured by our sysop to execute at a specific 
time every day Monday through Friday.  Can all of the steps I describe 
above be implemented using ANSI standard  SQL, and thus put into a stored 
procedure, or should I look at doing some of it in a Perl script or java 
application?  I suppose I will have to have at least a basic Java 
application, or perl script, if only to connect to the database and invoke any 
functions I have created to do this.

Any suggestions would be appreciated.

Thanks,

Ted

R.E. (Ted) Byers, Ph.D., Ed.D.
R & D Decision 
Support Solutions
http://www.randddecisionsupportsolutions.com/