2009/3/11 דניאל דנון <danondaniel@xxxxxxxxx> > *Handling (very) large files with PHP* > > Hello, I am planning a project in PHP, and I have few unsolved issues that > I'd like you to help me... > > The project will start by loading a file of about 50GB. > The file has a many objects with a pattern, for example, > > > Name: Joe > Joe likes to eat > ------------------- > Name: Daniel > Daniel likes to ask question on the PHP Mailing List > > > Anyway, so, I am going to convert it into a database, and I insist on using > PHP for this. > > So the questions are, > How would I open the file? will fopen fread($file, 1024) will work? if > then, > how would I find the seperator, "------------------", without taking too > many resources? I see no problem with this. I have nightly cron jobs written in PHP that process 10's of GB's of log files each night without any problems. I would use fgets rather than fread to get each line one at a time. > I'll have a dedicated server for this project so I could use exec, so I am > wondering if I should use exec to split the file? Why? PHP can easily handle a file this big so long as you don't try to load it all in at once. How many hours or days do you think it will take me to insert all of the > data, if I have about 8,000,000,000 (8 billion/milliard) entries (objects)? The only way to know this is to try it. Assuming you're on a Linux machine use head to slice a chunk off the top of the file and process that. Then multiply by however many of those chunks there are in 50GB. That, with a reasonable margin of error, will be how long it will take to process the full file. After I insert all the data, I'll have to start working with it as well - > for example, having a list of all people and what comes after the word > "likes" in their entry. I would approach this by organising the data into lookup tables while you read the file. It will make processing it a lot quicker. Doing something like that is way beyond the scope of this list but Google will be able to point you at the data structures and algorithms you'll need. What do you suggest? I am concerened I might not be able to fully acomplish > both high speed with working (example above) and both high speed when > watching the data and adding more "works" (as stated above) with PHP. What > do you think? > Since inserting to the database, after considering it, will probably be > with > C. But if I wish to work with it - will PHP be good? PHP will be slower than C, but by how much is difficult to say. If you can code in C and are happy using the C API for whatever DB you decide to use that would be far better, but you have to weight that against the additional development time. If you have a meaty server I'd say you can sacrifice runtime speed for development speed. What database should I use for so much info? It really doesn't matter, so long as you set up the right indexes before you import the file. Without the right indexes Oracle is just as slow as MySQL which is just as slow as Postgres. If you want an opinion I'd probably use Postgres for a dataset that big, but MySQL shouldn't have a problem with it. -Stuart -- http://stut.net/