Search Postgresql Archives

Re: Replacing Apache Solr with Postgre Full Text Search?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 26.03.20 17:05, J2eeInside J2eeInside wrote:
>> P.S. I need to index .pdf, .html and MS Word .doc/.docx files, is
>> there any constraints in Ful Text search regarding those file types?
>
> - Can you recommend those tools you mention above/any useful resource on how to do that?


For PDFs, I know of at least two tools that can extract text. Try Ghostscript:

    gs -sDEVICE=txtwrite -o output.txt input.pdf


or a tool called 'pdftotext':

    pdftotext [options] [PDF-file [text-file]]

Both give slightly different results, mainly in terms of indentation and layout of the generated plain text, and how they deal with tabular layouts.

Note that PDF is a container format that can embed virtually anything: text, images, flash videos, ... You'll get good results if the PDF input is plain text. If you're dealing with embedded images like scanned documents, you'll probably need a OCR pass with tools like 'tesseract' to extract the recognized text.

You'll need similar tools to extract the text from DOC and HTML files since you're only interested in their plain text representation, not the meta data and markup. Finding converters from HTML/DOC to plain text shouldn't be too hard. You could also try to find a commercial document conversion vendor, or try to convert HTML and DOC both to PDF so you'll only have to deal with PDF-to-text extraction in the end.

Good luck!

Artjom


--
Artjom Simon





[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux