shiplu wrote:
On Sun, Jan 24, 2010 at 3:11 AM, D. Dante Lorenso <dante@xxxxxxxxxxx> wrote:
All,
I'm loading millions of records into a backend PHP cli script that I
need to build a hash index from to optimize key lookups for data that
I'm importing into a MySQL database. The problem is that storing this
data in a PHP array is not very memory efficient and my millions of
records are consuming about 4-6 GB of ram.
What are you storing? An array of row objects??
In that case storing only the row id is will reduce the memory.
I am querying a MySQL database which contains 40 million records and
mapping string columns to numeric ids. You might consider it
normalizing the data.
Then, I am importing a new 40 million records and comparing the new
values to the old values. Where the value matches, I update records,
but where they do not match, I insert new records, and finally I go back
and delete old records. So, the net result is that I have a database
with 40 million records that I need to "sync" on a daily basis.
If you are loading full row objects, it will take a lot of memory.
But if you just load the row id values, it will significantly decrease
the memory amount.
For what I am trying to do, I just need to map a string value (32 bytes)
to a bigint value (8 bytes) in a fast-lookup hash.
Besides, You can load row ids in a chunk by chunk basis. if you have
10 millions of rows to process. load 10000 rows as a chunk. process
them then load the next chunk. This will significantly reduce memory
usage.
When importing the fresh 40 million records, I need to compare each
record with 4 different indexes that will map the record to existing
other records, or into a "group_id" that the record also belongs to. My
current solution uses a trigger in MySQL that will do the lookups inside
MySQL, but this is extremely slow. Pre-loading the mysql indexes into
PHP ram and processing that was is thousands of times faster.
I just need an efficient way to hold my hash tables in PHP ram. PHP
arrays are very fast, but like my original post says, they consume way
too much ram.
A good algorithm can solve your problem anytime. ;-)
It takes about 5-10 minutes to build my hash indexes in PHP ram
currently which makes up for the 10,000 x speedup on key lookups that I
get later on. I just want to not use the whole 6 GB of ram to do this.
I need an efficient hashing API that supports something like:
$value = (int) fasthash_get((string) $key);
$exists = (bool) fasthash_exists((string) $key);
fasthash_set((string) $key, (int) $value);
Or ... it feels like a "memcached" api but where the data is stored
locally instead of accessed via a network. So this is how my search led
me to what appears to be a dead "lchash" extension.
-- Dante
----------
D. Dante Lorenso
dante@xxxxxxxxxxx
972-333-4139
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php