Re: Pulling from Multiple Databases

Richard Quadling <rquadling@xxxxxxxxx> · Tue, 1 Feb 2011 16:46:42 +0000

On 1 February 2011 16:39, Jon Hood <squinky86@xxxxxxxxx> wrote:
> (comments in-line)
>
> On Tue, Feb 1, 2011 at 10:34 AM, Richard Quadling <rquadling@xxxxxxxxx>
> wrote:
>>
>> I use a data warehouse (a semi denormalized db) to hold data from
>> around 200 different data sources (DB, Excel spreadsheets, Web, etc.)
>>
>> I use multiple scripts to update the DB, each one tuned to a
>> particular frequency.
>>
>
> A different script for each database is a possibility. It's a little extra
> load on the app server, but it should be able to handle it. Maybe with
> pcntl_fork? I haven't explored this option much.
>
>>
>> My main app's queries are always against the data warehouse.
>>
>> That way, the live front end isn't worried about getting the source data.
>>
>> If the data is needed live, then I'd be looking to see if I can get a
>> live data feed from the source system. Essentially a push of the data
>> to my data warehouse - or a staging structure to allow locally defined
>> triggers to clean/process the data upon arrival.
>>
>> Automation is the key for me here. Rather than trying to do everything
>> for the request, respond to the changes in the data or live with the
>> fact that the data may be stale.
>>
>> Can you give us any clues as to the sort of app you are building? The
>> sort of data you are working on? Are you running your own servers?
>
> Data are needed live. 3 of the databases are MySQL. 14 are XML files that
> change frequently. 3 are JSON. 1 is Microsoft SQL Server 2005. The main app
> is running on linux (distribution doesn't matter - currently Debian, but I
> can change it to whatever if there's a reason). Most is financial data that
> needs ordered.
>
> I'm going to explore the pcntl_fork option some more...
>
> Thanks!
> Jon
>

If you are in control of the data, there are some things that may be useful.

1 - For tables that require syncing, I've added a timestamp column.
This is an automatically updated column whenever the data changes. In
the code handling the sync, I know that I don't need to retrieve any
data if the most recent timestamp is the same as the one I last got.
2 - For physical files, and assuming that the last modified
datetimestamp is maintained, then again, you have an indicator to know
if you need to actually process any data.
3 - For JSON ... if it is coming to you over the web, check headers to
see if the server is providing you a cached version. You may also be
able to save yourself the processing time if you know you've got a
stale response.

In a best case scenario, you poll all the source, realize that none of
them have any new data and you supply the data you already have
(caching the data is pretty much essential).

In a worse case scenario, you have to wait until all the data is
polled and stored. Probably no worse than you are already at.

Richard.

-- 
Richard Quadling
Twitter : EE : Zend
@RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php