Re: searching non plain text files

Sam Hobbs <Sam@xxxxxxxxxxxxxxxxxx> · Sat, 15 Dec 2018 22:31:44 -0800

There is no hammer that 
will crack them all. The best solution is to be 
able to recognize what type of file it is and do whatever processing is 
appropriate for that file. Many file formats are difficult to deal with.
 The original Microsoft Word file format was never totally documented, I
 think, because it was too complicated. You can probably find utilities 
for each file format but even then you must deal with all the many types
 separately.

Also note that many utilities were not designed for use in an online 
environment. Since the subject is PHP I assume you need to do this 
online.

There are utilities that will search files for ASCII text. Even that is 
less likely to work with Unicode text. If it is just ASCII then you 
could search for sequences of bytes that contain data in the range of 
character data that is normally printed. It is a very inaccurate 
algorithm. Note that PDF files are supposed to contain no binary data; 
in other words, no bytes that in the range 0 to 32, decimal. Binary data
 (in a PDF) is supposed to be stored in non-binary format.

You need to do some studying. For example, the Portable 
Executable format used for most Windows executables has a "Magic 
Number" (the characters "MZ") at the beginning and a pointer to a PE 
signature ("PE\0\0", the letters "P" and "E" followed by two null 
bytes). For Unix/Linux systems see COFF and ELF.
 That is just for executables, you need to study the many other formats 
too.

If you can be more specific then perhaps someone can provide a more 
specific answer. For example, if there are requirements that limit what 
needs to be searched for then it might be possible to be more specific.

   	Jeffry Killen
        Friday,

 December 14, 2018 8:19 PM

  Hello;

Can anyone 
point me to instruction/advice about
opening and reading files that 
are not plain text:

word processing docs, pdf, ps, image files,
even

 complied code.

I am writing a search function to search file 
systems
and don't know a lot about the formatting of non plain
text

 files.

The immediate concern is line breaks in word
processing

 docs, pdf and ps files.

Then detecting compiled code files so I 
can
leave them alone. This type of file might not
have a suffix to
 consider.

Then the various image files that might be
encountered.

Even

 suffixes aren't a guarantee of the content.

Thanks

Jeff 
K.