On Thu, Dec 4, 2008 at 10:35 PM, Jim Lucas <lists@xxxxxxxxx> wrote: > Shawn McKenzie wrote: >> Jim Lucas wrote: >>> Boyd, Todd M. wrote: >>>>> -----Original Message----- >>>>> From: Jagdeep Singh [mailto:jagsaini1982@xxxxxxxxx] >>>>> Sent: Thursday, December 04, 2008 8:39 AM >>>>> To: php-general@xxxxxxxxxxxxx >>>>> Subject: How to fetch .DOC or .DOCX file in php >>>>> Importance: Low >>>>> >>>>> Hi ! >>>>> >>>>> I want to fetch text from .doc / .docx file and save it into database >>>>> file. >>>>> But when I tried to fetch text with fopen/fgets etc ... It gave me >>>>> special >>>>> characters with text. >>>>> >>>>> (With .txt files everything is fine) >>>>> Only problem is with doc/docx files. >>>>> I dont know whow to remove "SPECIAL CHARACTERS" from this text ... >>>> A.) This has been handled on this list several times. Please search the >>>> archives before posting a question. >>>> B.) Did you even TRY to Google for this? In the first 5 matches for "php >>>> open ms word" I found this: >>>> >>>> http://www.developertutorials.com/blog/php/extracting-text-from-word-doc >>>> uments-via-php-and-com-81/ >>>> >>>> You will need an MS Windows machine for this solution to work. If you're >>>> using *nix... well... good luck. >>>> >>>> >>>> // Todd >>>> >>> Ah, not true about the MS requirement. If all you want is the clear/clean >>> text (without any formatting), then I can do it with php on any platform. >>> >>> If this is what is needed, here is the code to do it. >>> >>> <?php >>> >>> $filename = './12345.doc'; >>> if ( file_exists($filename) ) { >>> >>> if ( ($fh = fopen($filename, 'r')) !== false ) { >>> >>> $headers = fread($fh, 0xA00); >>> >>> # 1 = (ord(n)*1) ; Document has from 0 to 255 characters >>> $n1 = ( ord($headers[0x21C]) - 1 ); >>> >>> # 1 = ((ord(n)-8)*256) ; Document has from 256 to 63743 characters >>> $n2 = ( ( ord($headers[0x21D]) - 8 ) * 256 ); >>> >>> # 1 = ((ord(n)*256)*256) ; Document has from 63744 to 16775423 characters >>> $n3 = ( ( ord($headers[0x21E]) * 256 ) * 256 ); >>> >>> # (((ord(n)*256)*256)*256) ; Document has from 16775424 to 4294965504 characters >>> $n4 = ( ( ( ord($headers[0x21F]) * 256 ) * 256 ) * 256 ); >>> >>> # Total length of text in the document >>> $textLength = ($n1 + $n2 + $n3 + $n4); >>> >>> $extracted_plaintext = fread($fh, $textLength); >>> >>> # if you want the plain text with no formatting, do this >>> echo $extracted_plaintext; >>> >>> # if you want to see your paragraphs in a web page, do this >>> echo nl2br($extracted_plaintext); >>> >>> } >>> >>> } >>> >>> ?> >>> >>> Hope this helps. >>> >>> I am working on a set of php classes that will be able to read the text with the formatting included and convert it to a standard document format. >>> The standard format that it will end up in has yet >>> >> "has yet"... what? >> >> Are you O.K. Jim? Did you die while writing this? >> > > Sorry, still kickin' > > I was going to say that I haven't yet decided on what the final output format is going to be. Probably either rtf or OpenXML. > > How about I ask for suggestions on what would be the best format to store the final copy. > > I figured that this tool would mainly be used for .doc to web conversion, but I guess it could be used to also convert to other document formats too. > > But, I would like to have the ability to at least store the formating inline with the text. So, either some form of xml. Be it (x)HTML or plain XML > or even OpenXML. > > A question to all then. How would you like to see the text, with formating, stored? > > All suggestions welcome! > > -- > Jim Lucas > > "Some men are born to greatness, some achieve greatness, > and some have greatness thrust upon them." > > Twelfth Night, Act II, Scene V > by William Shakespeare > > -- > PHP General Mailing List (http://www.php.net/) > To unsubscribe, visit: http://www.php.net/unsub.php > > Is there a way to make it so that additional output renderers could be created? I'd lean towards xml though, since that can be parsed fairly easily. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php