Re: Replacing Apache Solr with Postgre Full Text Search?

Artjom Simon <artjom.simon@xxxxxxxxx> · Thu, 26 Mar 2020 19:19:42 +0100

On 26.03.20 17:05, J2eeInside J2eeInside wrote:
>> P.S. I need to index .pdf, .html and MS Word .doc/.docx files, is
>> there any constraints in Ful Text search regarding those file types?
>
> - Can you recommend those tools you mention above/any useful resource 
on how to do that?

For PDFs, I know of at least two tools that can extract text. Try 
Ghostscript:

    gs -sDEVICE=txtwrite -o output.txt input.pdf

or a tool called 'pdftotext':

    pdftotext [options] [PDF-file [text-file]]

Both give slightly different results, mainly in terms of indentation and 
layout of the generated plain text, and how they deal with tabular layouts.

Note that PDF is a container format that can embed virtually anything: 
text, images, flash videos, ...
You'll get good results if the PDF input is plain text. If you're 
dealing with embedded images like scanned documents, you'll probably 
need a OCR pass with tools like 'tesseract' to extract the recognized text.

You'll need similar tools to extract the text from DOC and HTML files 
since you're only interested in their plain text representation, not the 
meta data and markup.
Finding converters from HTML/DOC to plain text shouldn't be too hard. 
You could also try to find a commercial document conversion vendor, or 
try to convert HTML and DOC both to PDF so you'll only have to deal with 
PDF-to-text extraction in the end.

Good luck!

Artjom

--
Artjom Simon