Re: Double checking my logic?

Adrian Klaver <aklaver@xxxxxxxxxxx> · Mon, 20 Mar 2006 12:44:42 -0800



On Monday 20 March 2006 08:25 am, Ted Byers wrote:
> I have to deal with the following situation.
>
> I get a daily feed of data for several tables.  This data is retrieved from
> the supplier by ftp in the form of CSV text files.  I know I can load the
> files using "load data".  The relational structure between the files and
> tables is simple: one file contains data for a lookup table.  But the data
> in the lookup table is dynamic, potentially changing from one day to the
> next (though in practice, the rate of change is likely to be very slow).  I
> will also need to maintain a history in such a way that any and all changes
> in the data can be fully audited.  This means that, for auditing purposes,
> I will need to be able to extract valid time for each item in the lookup
> table.  The only likely change in the data will be that occassionally items
> will be added or removed from the lookup table.  None of the data already
> present in the database will be editable (data validation happens before I
> get the feed).  The only analysis of which I have been told is done daily,
> as the feed is received.  At present, no-one looks at earlier data
> (although I expect that will change as I create a better structure for the
> database to support audits).
>
> I am trying to create a flexible design so that refactoring will be simple
> if and when the assumptions or practices need to be changed.
>
> I know how to handle all this using a brute force approach, but I expect
> that approach will be painfully slow.  So here is what I am planning.
>
> 1) create a suite of tables corresponding to the files in the feed, with
> the addition of a date/time stamp containing the date and time on which the
> data being processed was received. 2) create a suite of temporary tables
> corresponding to the tables created in step 1 (but without the date/time
> stamp) 3) load the data into the temporary tables
> 4) analyse the data while it is in the temporary tables, storing the result
> of the analysis in new tables 5) copy the data into the permanent tables,
> and add the date and time stamp for the data (this date/time stamp is not
> present in the files retrieved). 6) free the temporary tables
>
> Now, this second last step is brute force, adequate for all but one of the
> tables: the look up table.  If I stick with the brute force approach, the
> lookup table will waste a significant amount of space.  This won't be much
> initially, but it is guaranteed to get worse as time passes and I'd expect
> the lookup performance to degrade as the amount of data in the lookup table
> increases.
>
> Each record in the lookup table represents a product, and both the name and
> the product will have valid time intervals that may not be related.  The
> name may change because the product has been renamed for whatever reason
> (and we don't care why), or the name may disappear altogether because the
> product has been discontinued.  We can distinguish the two cases because
> each product has an ID that remains valid while the product exists, and the
> ID won't be in the data at all if the product is discontinued.
>
> I am considering creating an additional table just to lookup product names,
> but with two date and time stamps.  The first would represent  the first
> time the product name appears in the data and the last would represent the
> last time the product name is present in the data.  The first of these
> would be edited only once, and that is on the first day for which we have
> data.  I am torn between updating the last of these every day, until the
> name disappears, or leave it null until the name disappears.  leaving it
> null would save on space, but updating it with the current data and time
> should save time since it would not be necessary to execute a complex
> conditional on every product ever included in the database.  If we update
> it only for those items in today's data, those that were terminated before
> today will not have their date/time stamp updated, so the two fields will
> always represent the time interval for which the name is valid. A similar
> logic applies to the product ID.
>
> Have I overlooked anything that is obvious to you?  Any gotchas I should be
> aware of?
>
> What opportunities for improving performance do you see?
>
> The ultimate objective is to have this fully automated from a shell script
> that is configured by our sysop to execute at a specific time every day
> Monday through Friday.  Can all of the steps I describe above be
> implemented using ANSI standard  SQL, and thus put into a stored procedure,
> or should I look at doing some of it in a Perl script or java application? 
> I suppose I will have to have at least a basic Java application, or perl
> script, if only to connect to the database and invoke any functions I have
> created to do this.
>
> Any suggestions would be appreciated.
>
> Thanks,
>
> Ted
>
> R.E. (Ted) Byers, Ph.D., Ed.D.
> R & D Decision Support Solutions
> http://www.randddecisionsupportsolutions.com/

I faced a similar problem where I was trying to keep track of changes to a 
FoxPro database that I only had indirect access to.  My solution followed 
your proposal to a degree. I imported the new data on a daily basis to  
holding tables. I then ran a series of functions to compare the data in the 
holding tables to the data in my 'real'  tables. The differences 
(added,deleted,changed) were written to audit tables with a timestamp. The  
'real' tables where TRUNCATED and the new data transferred into them and then 
the holding tables were cleaned out. This way my 'real'  tables only 
contained the minimum data necessary. The audit tables grew but where not 
queried as much as the 'real' tables so the apparent speed of the lookup 
process stayed relatively stable.
-- 
Adrian Klaver	
aklaver@xxxxxxxxxxx