Re: Handling (very) large files with PHP

Stuart <stuttle@xxxxxxxxx> · Wed, 11 Mar 2009 10:00:32 +0000

2009/3/11 דניאל דנון <danondaniel@xxxxxxxxx>

> *Handling (very) large files with PHP*
>
> Hello, I am planning a project in PHP, and I have few unsolved issues that
> I'd like you to help me...
>
> The project will start by loading a file of about 50GB.
> The file has a many objects with a pattern, for example,
>
>
> Name: Joe
> Joe likes to eat
> -------------------
> Name: Daniel
> Daniel likes to ask question on the PHP Mailing List
>
>
> Anyway, so, I am going to convert it into a database, and I insist on using
> PHP for this.
>
> So the questions are,
> How would I open the file? will fopen fread($file, 1024) will work? if
> then,
> how would I find the seperator, "------------------", without taking too
> many resources?

I see no problem with this. I have nightly cron jobs written in PHP that
process 10's of GB's of log files each night without any problems. I would
use fgets rather than fread to get each line one at a time.

> I'll have a dedicated server for this project so I could use exec, so I am
> wondering if I should use exec to split the file?

Why? PHP can easily handle a file this big so long as you don't try to load
it all in at once.

How many hours or days do you think it will take me to insert all of the
> data, if I have about 8,000,000,000 (8 billion/milliard) entries (objects)?

The only way to know this is to try it. Assuming you're on a Linux machine
use head to slice a chunk off the top of the file and process that. Then
multiply by however many of those chunks there are in 50GB. That, with a
reasonable margin of error, will be how long it will take to process the
full file.

After I insert all the data, I'll have to start working with it as well -
> for example, having a list of all people and what comes after the word
> "likes" in their entry.

I would approach this by organising the data into lookup tables while you
read the file. It will make processing it a lot quicker. Doing something
like that is way beyond the scope of this list but Google will be able to
point you at the data structures and algorithms you'll need.

What do you suggest? I am concerened I might not be able to fully acomplish
> both high speed with working (example above) and both high speed when
> watching the data and adding more "works" (as stated above) with PHP. What
> do you think?
> Since inserting to the database, after considering it, will probably be
> with
> C. But if I wish to work with it - will PHP be good?

PHP will be slower than C, but by how much is difficult to say. If you can
code in C and are happy using the C API for whatever DB you decide to use
that would be far better, but you have to weight that against the additional
development time. If you have a meaty server I'd say you can sacrifice
runtime speed for development speed.

What database should I use for so much info?

It really doesn't matter, so long as you set up the right indexes before you
import the file. Without the right indexes Oracle is just as slow as MySQL
which is just as slow as Postgres.

If you want an opinion I'd probably use Postgres for a dataset that big, but
MySQL shouldn't have a problem with it.

-Stuart

-- 
http://stut.net/