On 26.03.20 17:05, J2eeInside J2eeInside wrote:
>> P.S. I need to index .pdf, .html and MS Word .doc/.docx files, is
>> there any constraints in Ful Text search regarding those file types?
>
> - Can you recommend those tools you mention above/any useful resource
on how to do that?
For PDFs, I know of at least two tools that can extract text. Try
Ghostscript:
gs -sDEVICE=txtwrite -o output.txt input.pdf
or a tool called 'pdftotext':
pdftotext [options] [PDF-file [text-file]]
Both give slightly different results, mainly in terms of indentation and
layout of the generated plain text, and how they deal with tabular layouts.
Note that PDF is a container format that can embed virtually anything:
text, images, flash videos, ...
You'll get good results if the PDF input is plain text. If you're
dealing with embedded images like scanned documents, you'll probably
need a OCR pass with tools like 'tesseract' to extract the recognized text.
You'll need similar tools to extract the text from DOC and HTML files
since you're only interested in their plain text representation, not the
meta data and markup.
Finding converters from HTML/DOC to plain text shouldn't be too hard.
You could also try to find a commercial document conversion vendor, or
try to convert HTML and DOC both to PDF so you'll only have to deal with
PDF-to-text extraction in the end.
Good luck!
Artjom
--
Artjom Simon