RE: Performance question for table updating

Andrés Robinet <agrobinet@xxxxxxxxxxxxx> · Sat, 24 Nov 2007 05:49:38 -0300

> -----Original Message-----
> From: Jon Westcot [mailto:jon@xxxxxxxxxxx]
> Sent: Saturday, November 24, 2007 4:32 AM
> To: PHP General
> Subject:  Performance question for table updating
> 
> Hi all:
> 
>     For those who've been following the saga, I'm working on an
> application that needs to load a data file consisting of approximately
> 29,000 to 35,000 records in it (and not short ones, either) into
> several tables.  I'm using MySQL as the database.
> 
>     I've noticed a really horrible performance difference between
> INSERTing rows into the table and UPDATEing rows with new data when
> they already exist in the table.  For example, when I first start with
> an empty table, the application inserts around 29,600 records in
> something less than 6 minutes.  But, when I use the second file,
> updating that same table takes over 90 minutes.
> 
>     Here's my question: I had assumed -- probably wrongly -- that it
> would be far more expedient to only update rows where data had actually
> changed; moreover, that I should only update the changed fields in the
> particular rows.  This involves a large number of if statements, i.e.,
> 
>     if($old_row["field_a"] !== $new_row["field_66"] {
>         $update_query .= "field_a = '" .
> mysql_real_escape_string($new_row["field_66"]) . "',";
>     }
> 
>     Eventually, I wind up with a query similar to:
> 
>         UPDATE table_01 SET field_a = 'New value here',
> updated=CURDATE() WHERE primary_key=12345
> 
>     I thought that, to keep the table updating to a minimum, this
> approach made the most sense.  However, seeing the two hugely different
> performance times has made me question whether or not it would be
> faster to simply update every field in the table and eliminate all of
> these test conditions.
> 
>     And, before someone comments that indexes on the table can cause
> performance hits, I DROP nearly all of the indexes at the start of the
> processing, only keeping those indexes necessary to do the original
> INSERT or the subsequent UPDATE, and then add all of the extra
> "steroid" indexes (you know -- the performance-enhancing ones <g>)
> after all of the INSERTs and UPDATEs have been finished.
> 
>     So, long story short (oops -- too late!), what's the concensus
> among the learned assembly here?  Is it faster to just UPDATE the
> record if it already exists regardless of the fact that maybe only one
> or two out of 75 or more fields changed versus testing each one of
> those 75 fields to try and figure out which ones actually changed and
> then only update those?
> 
>     I look forward to reading all of your thoughts.
> 
>     Sincerely,
> 
>         Jon

I don't know about consensus over here because I'm kind of newgie (stands
for new geek, as opposed to newbie which stands for new ball breaker :D :D
). I don't know of your previous messages but I can tell you one story...
Some time ago I got involved in a project that required geo-distance
calculation (you know distance between two points with latitude and
longitude). Basically I had to take a set of points and calculate the
distance of each of those points to a given (reference) one. The math was
something like the "square root of the sum of a constant times the square
sin of..." well, I can't remember it, but the point is, it was a complicated
formula, which I thought it would allow for some optimizations in PHP.
Accustomed to regular (compiled) programming languages I developed a set of
routines to optimize the task and went ahead and queried the database for
the (say, 1000 records) dataset of points. Then applied the math to the
points and the reference point and got the result... in about 5 minutes to
my (disgusting) surprise.
Then I grabbed the MySQL manual, built a "non-optimized" version of the
formula to put directly in the SQL query and get the "shortest distance"
(which was my goal in the end) calculated by MySQL right away. I thought
"ok, I'll prepare a cup of coffee to wait for MySQL to finish the
calculation". To my surprise the query returned the expected result in less
than 2 seconds.
My logic was (wrongly) the following: PHP is a programming language, SQL is
a data access language; I'll get the data using MySQL and do the math using
PHP. But I forgot PHP is an interpreted language, that a number is more than
a number to PHP, but a ZVAL_<whatever> object behind the scenes. I forgot
about the memory and the time required to build those objects when one
retrieves data out of a database server. I forgot about parsing time, and
"support logic and safety checks" in the language that overkill any attempt
to build TDCPL (Too Damn Complex Programming Logic) in PHP.
So, now, when I have to do some logic stuff to the retrieved data, I first
check "how much" I can push into the query itself, to get little or nothing
of programming logic in PHP after retrieving (before storing) the data.
All that said, I'd give a shot to the MySQL REPLACE function (I wouldn't
even branch the code to use INSERT or UPDATE depending on the record already
existing or not, If you have a primary key, all you need is REPLACE). But,
PLEASE LOOK AT THE GOTCHAS (like set col_name=col_name+1). Furthermore, If
those data files were to be uploaded by me (I mean, me, the coder, not the
end user), I'd build (use) a program to convert them to SQL sentences in my
desktop PC where I can use faster programming languages and I can wait for
five minutes of heavy processing (instead of overkilling the server for five
minutes which will slow down every other service in there).
In the end it depends on your requirements and where you get the data from
and if and how you want to automate the task (I didn't get your previous
messages, I got subscribed recently, if you can send me a link to those
ones... great!)

Rob

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php