There is no hammer that
will crack them all. The best solution is to be
able to recognize what type of file it is and do whatever processing is
appropriate for that file. Many file formats are difficult to deal with.
The original Microsoft Word file format was never totally documented, I
think, because it was too complicated. You can probably find utilities
for each file format but even then you must deal with all the many types
separately. Also note that many utilities were not designed for use in an online environment. Since the subject is PHP I assume you need to do this online. There are utilities that will search files for ASCII text. Even that is less likely to work with Unicode text. If it is just ASCII then you could search for sequences of bytes that contain data in the range of character data that is normally printed. It is a very inaccurate algorithm. Note that PDF files are supposed to contain no binary data; in other words, no bytes that in the range 0 to 32, decimal. Binary data (in a PDF) is supposed to be stored in non-binary format. You need to do some studying. For example, the Portable Executable format used for most Windows executables has a "Magic Number" (the characters "MZ") at the beginning and a pointer to a PE signature ("PE\0\0", the letters "P" and "E" followed by two null bytes). For Unix/Linux systems see COFF and ELF. That is just for executables, you need to study the many other formats too. If you can be more specific then perhaps someone can provide a more specific answer. For example, if there are requirements that limit what needs to be searched for then it might be possible to be more specific.
|