Re: memory efficient hash table extension? like lchash ...

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



J Ravi Menon wrote:
PHP does expose sys V shared-memory apis (shm_* functions):
http://us2.php.net/manual/en/book.sem.php


I will look into this. I really need a key/value map, though and would rather not have to write my own on top of SHM.


If you already have apc installed, you could also try:
http://us2.php.net/manual/en/book.apc.php
APC also allows you to store user specific data too (it will be in a
shared memory).


I've looked into the apc_store and apc_fetch routines:
http://php.net/manual/en/function.apc-store.php
http://www.php.net/manual/en/function.apc-fetch.php
... but quickly ran out of memory for APC and though I figured out how to configure it to use more (adjust shared memory allotment), there were other problems. I ran into issues with logs complaining about "cache slamming" and other known bugs with APC version 3.1.3p1. Also, after several million values were stored, the APC storage began to slow down *dramatically*. I wasn't certain if APC was using only RAM or was possibly also writing to disk. Performance tanked so quickly that I set it aside as an option and moved on.


Haven't tried these myself, so I would do some quick tests to ensure
if they meet your performance requirements. In theory, it should be
faster than berkeley-db like solutions (which is also another option
but it seems something similar like MongoDB was not good enough?).


I will run more tests against MongoDB. Initially I tried to use it to store everything. If I only store my indexes, it might fare better. Certainly, though, running queries and updates against a remote server will always be slower than doing the lookups locally in ram.


I  am curious to know if someone here has run these tests. Note that
with memcached installed locally (on the same box running php), it can
be surprisingly efficient - using pconnect(),  caching the handler in
a static var for a given request cycle etc...

memcached gives no guarantee about data persistence. I need to have a hash table that will contain all the values I set. They don't need to survive a server shutdown (don't need to be written to disk), but I can not afford for the server to throw away values that don't fit into memory. If there is a way to configure memcached guarantee storage, that might work.

-- Dante


On Sun, Jan 24, 2010 at 9:39 AM, D. Dante Lorenso <dante@xxxxxxxxxxx> wrote:
shiplu wrote:
On Sun, Jan 24, 2010 at 3:11 AM, D. Dante Lorenso <dante@xxxxxxxxxxx>
wrote:
All,

I'm loading millions of records into a backend PHP cli script that I
need to build a hash index from to optimize key lookups for data that
I'm importing into a MySQL database.  The problem is that storing this
data in a PHP array is not very memory efficient and my millions of
records are consuming about 4-6 GB of ram.

What are you storing? An array of row objects??
In that case storing only the row id is will reduce the memory.
I am querying a MySQL database which contains 40 million records and mapping
string columns to numeric ids.  You might consider it normalizing the data.

Then, I am importing a new 40 million records and comparing the new values
to the old values.  Where the value matches, I update records, but where
they do not match, I insert new records, and finally I go back and delete
old records.  So, the net result is that I have a database with 40 million
records that I need to "sync" on a daily basis.

If you are loading full row objects, it will take a lot of memory.
But if you just load the row id values, it will significantly decrease
the memory amount.
For what I am trying to do, I just need to map a string value (32 bytes) to
a bigint value (8 bytes) in a fast-lookup hash.

Besides, You can load row ids in a chunk by chunk basis. if you have
10 millions of rows to process. load 10000 rows as a chunk. process
them then load the next chunk.  This will significantly reduce memory
usage.
When importing the fresh 40 million records, I need to compare each record
with 4 different indexes that will map the record to existing other records,
or into a "group_id" that the record also belongs to.  My current solution
uses a trigger in MySQL that will do the lookups inside MySQL, but this is
extremely slow.  Pre-loading the mysql indexes into PHP ram and processing
that was is thousands of times faster.

I just need an efficient way to hold my hash tables in PHP ram.  PHP arrays
are very fast, but like my original post says, they consume way too much
ram.

A good algorithm can solve your problem anytime. ;-)
It takes about 5-10 minutes to build my hash indexes in PHP ram currently
which makes up for the 10,000 x speedup on key lookups that I get later on.
 I just want to not use the whole 6 GB of ram to do this.   I need an
efficient hashing API that supports something like:

       $value = (int) fasthash_get((string) $key);
       $exists = (bool) fasthash_exists((string) $key);
       fasthash_set((string) $key, (int) $value);

Or ... it feels like a "memcached" api but where the data is stored locally
instead of accessed via a network.  So this is how my search led me to what
appears to be a dead "lchash" extension.

-- Dante

----------
D. Dante Lorenso
dante@xxxxxxxxxxx
972-333-4139

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php





--
----------
D. Dante Lorenso
dante@xxxxxxxxxxx
972-333-4139

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux