Way back when, I was using pdftotext and other various doc to text converters to copy the text into a second text file which I then indexed with sphinx search. You can nowadays index quite well also with PostgreSQL or maybe even MySQL.
good luck!
On Sat, 15 Dec 2018 17:20 Jeffry Killen <jekillen@xxxxxxxxxxx wrote:
Hello;
Can anyone point me to instruction/advice about
opening and reading files that are not plain text:
word processing docs, pdf, ps, image files,
even complied code.
I am writing a search function to search file systems
and don't know a lot about the formatting of non plain
text files.
The immediate concern is line breaks in word
processing docs, pdf and ps files.
Then detecting compiled code files so I can
leave them alone. This type of file might not
have a suffix to consider.
Then the various image files that might be
encountered.
Even suffixes aren't a guarantee of the content.
Thanks
Jeff K.