Re: Read Through PHP Files

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



 You cannnot just open those files. That things that you see are not
 'rubish' or whatever. Those files are in a binary format. You need to
 understand the .doc format and the .pdf format. You can get this
 infromation by using google and search for 'Binary word format' and so on.
 Then you have to parse the file with the HEX codes etc and so on. This is
 pretty complex and I'm sure you dont wanna do that :D. Maybe there is
 allready a libary also in PHP that does it for you.

 But in generaly, you have to think in a different way. If you dont
 unserstand what binary formats are and how to parse them, its pretty hard
 and its better if you dont try it :)

 on Friday 10 November 2006 11:55, Kevin wrote:
 > Hi,
 >
 > I am using the function fopen to open a word document, loading the
 > contents into a variable and then using a substr_count to count the
 > number of times a certain string is found, this is allowing me to search
 > through the file and say how many times the word appears, I can even use
 > str_replace to highlight certain words. However Microsoft word seems to
 > put a lot of rubbish in the header and footer, I am wondering is it
 > possible to filter this rubbish out to get the exact document.
 >
 > I also tried using fopen to open a PDF file, but as PDF is handled
 > differently it came up completely different with no words at all, just
 > full of rubbish. Is there anyway I can get this information using a
 > simple fopen?
 >
 > I am basically trying to create a search engine which can read within
 > files similar to google. The only problem I would have after I have done
 > all this is actually weighting the search results, however I would
 > probably have to create the results first and then finally go through
 > the results to try to weight them.
 >
 > Does anyone else have any experience in this or could help me out with
 > any of the problems I am having?
 >
 > Thanks
 >
 > Kevin

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux