Search Engine in PHP.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Code for search engine in PHP (in case someone needs it).

The files in this email are:

* create_index_directories.php
* create_index_or_add_to_existing_index.php
* search_index.php
* ReadMe.txt

Brief description: This search engine has a new architecture compared
to other search engines. I invented and implemented this new search
engine architecture. This search engine has been developed mainly for
English Alphabet. This search engine is based on the fact that no
letter in English Alphabet has more than 30,000 words starting with
it. This search engine works on text/html files only. This search
engine was mainly developed so that it could be used on websites. So,
now websites can integrate this search engine on their platform so
that a user can search anything on their website. The website can
index all their pages through this search engine and also give a
search box to the user. The websites now do not have to rely on third
party search engines.

All code in this email has been released under APACHE LICENSE, VERSION
2.0. The license details can be found here:
https://www.apache.org/licenses/LICENSE-2.0

---------------------------------------
create_index_directories.php
---------------------------------------

<?php

/* This program creates index directories for storing index files.
 * Required argument: Path to directory where the top level index
 * directory and its subdirectories will be created. The top level index
 * directory will be named index_directory.
 */

$num_directories_given = 0;
$index_dir = "";

function print_usage()
{
    echo ("Usage:\n\n" .
          "  Syntax:\n\n" .
          "    create_index_directories [OPTIONS] [dir_path]\n\n" .
          "  Description:\n\n" .
          "    create_index_directories creates index directories for
storing index files.\n" .
          "    \"dir_path\" is the path to directory where the top
level index directory\n" .
          "    and sub directories will be created. The top level
index directory will\n" .
          "    be named index_directory.\n\n" .
          "  Options:\n\n" .
          "    --help\n" .
          "      Print this usage/help and exit.\n");
} // end of print_usage

for ($i = 1; $i < $argc; $i++) {
    //echo "Option " . $i . ": " . $argv[$i] . "\n";
    $arg = $argv[$i];
    if ($arg[0] === '-') {
        if ($arg === "--help") {
            print_usage();
            exit(0);
        } else {
            echo "create_index_directories: Unknown option: " . $arg . "\n";
            echo "Try create_index_directories --help to see the help.\n";
            exit(1);
        }
    } else {
        $index_dir = $arg;
        $num_directories_given++;
    }
} // end of for loop

if ($num_directories_given == 0) {
    echo "create_index_directories: One directory argument is required.\n";
    echo "Try create_index_directories --help to see the help.\n";
    exit(1);
} else if ($num_directories_given > 1) {
    echo "create_index_directories: \"Only one directory\" argument is
required.\n";
    echo "Try create_index_directories --help to see the help.\n";
    exit(1);
}

if (is_dir($index_dir) != TRUE) {
    echo "create_index_directories: \"" . $index_dir . "\" is not a
directory.\n";
    echo "Try create_index_directories --help to see the help.\n";
    exit(1);
}

$create_dir = $index_dir . "/index_directory";
if (file_exists($create_dir) != TRUE) {
    if (mkdir($create_dir) != TRUE) {
        echo "create_index_directories: Failed to create directory \""
. $create_dir . "\". Exiting...\n";
        exit(1);
    } else {
        echo "Created directory " . $create_dir . "\n";
    }
} else {
    echo $create_dir . " already exists.\n";
}

for ($i = 0; $i < 10; $i++) {
    $sub_dir = $create_dir . "/" . $i;
    if (file_exists($sub_dir) != TRUE) {
        if (mkdir($sub_dir) != TRUE) {
            echo "create_index_directories: Failed to create directory
\"" . $sub_dir . "\". Exiting...\n";
            exit(1);
        } else {
            echo "Created directory " . $sub_dir . "\n";
        }
    } else {
        echo $sub_dir . " already exists.\n";
    }
}

foreach (range('a', 'z') as $letter) {
    $sub_dir = $create_dir . "/" . $letter;
    if (file_exists($sub_dir) != TRUE) {
        if (mkdir($sub_dir) != TRUE) {
            echo "create_index_directories: Failed to create directory
\"" . $sub_dir . "\". Exiting...\n";
            exit(1);
        } else {
            echo "Created directory " . $sub_dir . "\n";
        }
    } else {
        echo $sub_dir . " already exists\n";
    }
}

?>

-----------------------------------------------------------
create_index_or_add_to_existing_index.php
-----------------------------------------------------------

<?php

/* This program takes files/directories as arguments and parses the
 * files (present in directories or given on command line) to create the
 * search index files. The directories are processed recursively if -r option
 * is given. This program also requires the path to directory where
 * a directory called index_directory exists. This index_directory
 * contains 36 folders named 0, 1, 2, .., 9 and a, b, c, .., y, z.
 * Index files are created in subdirectories of index_directory.
 * This program works on text/html files only. You can use program
 * create_index_directories.php to create index_directory and its
subdirectories.
 */

// error handler function
function custom_error_handler($errno, $errstr, $errfile, $errline)
{
    //echo "Got error/notice/warning, etc. Exiting..\n";
    echo "Got error/notice/warning, etc.\n";
    echo $errno. "\n";
    echo $errtsr . "\n";
    echo $errfile . "\n";
    echo $errline . "\n";
    //echo "Exit status is 1.\n";
    //exit(1);
} // end of custom_error_handler
// set to the user defined error handler
$old_error_handler = set_error_handler("custom_error_handler");

function print_usage()
{
    echo ("Usage:\n\n" .
          "  Syntax:\n\n" .
          "    create_index_or_add_to_existing_index OPTION[S]
[FILE...] [DIR...]\n\n" .
          "  Description:\n\n" .
          "    create_index_or_add_to_existing_index parses a file and
creates search index files\n" .
          "    or adds to already existing index files. It works on
text/html files only.\n" .
          "    The file can be given as an argument or it may be
present in a directory\n" .
          "    which itself has been given as an argument. This
program also requires\n" .
          "    the path to directory where a directory called
index_directory\n" .
          "    and its subdirectories (0-9, a-z) exist. You can use\n" .
          "    program create_index_directories.php to create
index_directory\n" .
          "    and its subdirectories. The paths to file/dir to be
indexed should be\n" .
          "    relative to server_root_directory_path (to be given by
specifying -s option).\n\n" .
          "  Options:\n\n" .
          "     -i path_to_index_directory (MANDATORY option)\n" .
          "        Use -i option to specify the path to directory
where directory\n" .
          "        called index_directory and its subdirectories (0-9,
a-z) exist.\n" .
          "        Index files are created in subdirectories of
index_directory.\n\n" .
          "     -r\n" .
          "        Specify -r option to process directory/directories
recursively.\n\n" .
          "     -p prefix_path\n" .
          "        Please give a prefix to add before the file path
that will be written to\n" .
          "        index files. It could be something like
https://mywebsite.com. If the\n" .
          "        file path abcd/tyr.html is going to be written to
index file then it\n" .
          "        will actually write
https://mywebsite.com/abcd/tyr.html in the index\n" .
          "        file if -p option is present.\n\n" .
          "     -s server_root_directory_path (MANDATORY option)\n" .
          "        The \"absolute\" path to server root directory
(from where index.html or index.php will be served).\n" .
          "        The paths to file/dir to be indexed should be
relative to server_root_directory_path.\n\n" .
          "    --help\n".
          "        Print this usage/help and exit.\n\n" .
          " So, basically the file to be indexed is found by combining
server_root_directory_path\n" .
          " and path to files/directories given on command line while
the file contents\n" .
          " to be written is formed by combining prefix and path to
files/directories given\n" .
          " on command line.\n");
} // end of print_usage

$iOptionPresent = FALSE;
$rOptionPresent = FALSE;
$pOptionPresent = FALSE;
$sOptionPresent = FALSE;
$index_dir_parent = "";
$index_dir = "";
$prefix = "";
$server_root_path = "";
$file_dir_array = array();
$num_files_processed = 0;

for ($i = 1; $i < $argc; $i++) {
    echo "debug: Argument/Option " . $i . ": " . $argv[$i] . "\n";
    $arg = $argv[$i];
    if ($arg[0] === '-') {
        if ($arg === "--help") {
            print_usage();
            exit(0);
        } else if ($arg === "-r") {
            $rOptionPresent = TRUE;
        } else if ($arg === "-i") {
            $iOptionPresent = TRUE;
            if (($i+1) < $argc) {
                $index_dir_parent = $argv[$i+1];
                $index_dir = $index_dir_parent . "/" . "index_directory";
                $i++;
                continue;
            }
        } else if ($arg === "-p") {
            $pOptionPresent = TRUE;
            if (($i+1) < $argc) {
                $prefix = $argv[$i+1];
                if ((substr($prefix, -1, 1) != "/") &&
(substr($prefix, -1, 1) != "\\")) {
                    $prefix = $prefix . "/";
                }
                $i++;
                continue;
            }
        } else if ($arg === "-s") {
            $sOptionPresent = TRUE;
            if (($i+1) < $argc) {
                $server_root_path = $argv[$i+1];
                if ((substr($server_root_path, -1, 1) != "/") &&
(substr($server_root_path, -1, 1) != "\\")) {
                    $server_root_path = $server_root_path . "/";
                }
                $i++;
                continue;
            }
        } else {
            echo "create_index_or_add_to_existing_index: Unknown
option: " . $arg . "\n";
            echo "Try create_index_or_add_to_existing_index --help to
see the help.\n";
            exit(1);
        }
    } else {
        array_push($file_dir_array, $arg);
    }
} // end of for loop

// debug info
echo "\nDEBUG_INFO_START:\n\n";
if ($rOptionPresent === TRUE) {
    echo "-r option is present.\n";
} else {
    echo "-r option is NOT present.\n";
}
if ($iOptionPresent === TRUE) {
    echo "-i option is present.\n";
    echo "index_dir_parent = " . $index_dir_parent . "\n";
} else {
    echo "-i option is NOT present.\n";
}
if ($pOptionPresent === TRUE) {
    echo "-p option is present.\n";
    echo "prefix = " . $prefix . "\n";
} else {
    echo "-p option is NOT present.\n";
}
if ($sOptionPresent === TRUE) {
    echo "-s option is present.\n";
    echo "server_root_path = " . $server_root_path . "\n";
} else {
    echo "-s option is NOT present.\n";
}
$num_entries = count($file_dir_array);
echo "Entries in file_dir_array are:\n";
for ($i = 0; $i < $num_entries; $i++){
    echo $file_dir_array[$i] . "\n";
}
echo "\nDEBUG_INFO_END\n\n";
// end debug info

if ($index_dir_parent == "") {
    echo "create_index_or_add_to_existing_index: Please give the path
to directory where index_directory exist.\n";
    echo "Try create_index_or_add_to_existing_index --help to see the help.\n";
    echo "Exiting..\n";
    exit(1);
}

if ($server_root_path == "") {
    echo "create_index_or_add_to_existing_index: Please give the path
to server root directory.\n";
    echo "Try create_index_or_add_to_existing_index --help to see the help.\n";
    echo "Exiting..\n";
    exit(1);
}

if (file_exists($index_dir_parent) != TRUE) {
    echo "create_index_or_add_to_existing_index: \"" .
$index_dir_parent . "\" does not exist.\n";
    echo "Please give a valid path to directory where index_directory exist.\n";
    echo "Try create_index_or_add_to_existing_index --help to see the help.\n";
    echo "Exiting..\n";
    exit(1);
}

if (is_dir($index_dir_parent) != TRUE) {
    echo "create_index_or_add_to_existing_index: \"" .
$index_dir_parent . "\" is not a directory.\n";
    echo "Please give a valid path to directory where index_directory exist.\n";
    echo "Try create_index_or_add_to_existing_index --help to see the help.\n";
    echo "Exiting..\n";
    exit(1);
}

if (file_exists($index_dir) != TRUE) {
    echo "create_index_or_add_to_existing_index: \"index_directory\"
does not exist in \"" . $index_dir_parent . "\".\n";
    echo "Please give a valid path to directory where index_directory exist.\n";
    echo "Try create_index_or_add_to_existing_index --help to see the help.\n";
    echo "Exiting..\n";
    exit(1);
}

if (is_dir($index_dir) != TRUE) {
    echo "create_index_or_add_to_existing_index: index_directory \"" .
$index_dir . "\" is not a directory.\n";
    echo "Please give a valid path to directory where index_directory exist.\n";
    echo "Try create_index_or_add_to_existing_index --help to see the help.\n";
    echo "Exiting..\n";
    exit(1);
}

if (count($file_dir_array) < 1) {
    echo "create_index_or_add_to_existing_index: No files/directories
given for indexing.\n";
    echo "Try create_index_or_add_to_existing_index --help to see the help.\n";
    exit(0);
}

// Check if all index directories exist
echo "create_index_or_add_to_existing_index: checking whether all
index directories exist..\n";
for ($i = 0; $i < 10; $i++) {

    $sub_dir = $index_dir . "/" . $i;
    if (file_exists($sub_dir) != TRUE) {
        echo $sub_dir . " does not exist.\n";
        echo "Exiting..\n";
        exit(1);
    }
    if (is_dir($sub_dir) != TRUE) {
        echo $sub_dir . " is not a directory.\n";
        echo "Exiting..\n";
        exit(1);
    }

} // end of for loop

foreach (range('a', 'z') as $letter) {

    $sub_dir = $index_dir . "/" . $letter;
    if (file_exists($sub_dir) != TRUE) {
        echo $sub_dir . " does not exist.\n";
        echo "Exiting..\n";
        exit(1);
    }
    if (is_dir($sub_dir) != TRUE) {
        echo $sub_dir . " is not a directory.\n";
        echo "Exiting..\n";
        exit(1);
    }

} // end of foreach loop

echo "All index directories exist.\n\n";

echo "\n\n**** Starting Indexing.. ****\n\n";

$num_entries = count($file_dir_array);
for ($i = 0; $i < $num_entries; $i++) {

    $file_rl_path = $file_dir_array[$i];
    $file = $server_root_path . $file_rl_path;

    if (file_exists($file) != TRUE) {
        echo "\"" . $file . "\" does not exist.\n";
    } else if (is_file($file) == TRUE) {
        process_file($file, $file_rl_path);
    } else if (is_dir($file) == TRUE) {
        process_dir($file);
    } else {
        echo "\"" . $file . "\": No such file or directory.\n";
    }

} // end of for loop

function process_dir($dir) {

    //echo $dir . "\n";
    $files = scandir($dir);
    if ($files == FALSE) {
        return;
    }
    $num = count($files);
    for ($i = 0; $i < $num; $i++) {
        if (($files[$i] === ".") || ($files[$i] === "..")) {
            continue;
        }
        $file_entry = $dir . "/" . $files[$i];
        if (file_exists($file_entry) != TRUE) {
            echo "\"" . $file_entry . "\" does not exist.\n";
        } else if (is_file($file_entry) == TRUE) {
            $empty_string = "";
            $root_path = $GLOBALS['server_root_path'];
            $file_rl_path = str_replace($root_path, $empty_string, $file_entry);
            //echo "Old file_rl_path  = " . $file_entry . ", New
file_rl_path  = " . $file_rl_path . "\n";
            process_file($file_entry, $file_rl_path);
        } else if (is_dir($file_entry) == TRUE) {
            if ($GLOBALS['rOptionPresent'] === TRUE) {
                process_dir($file_entry);
            } else {
                //echo $file_entry . "\n"; // remove this later // TODO
            }
        } else {
            echo "\"" . $file_entry . "\": No such file or directory.\n";
        }
    } // end of for loop

} // end of process_dir

function process_file($file, $file_rl_path) {

    //echo $file . "\n";
    $handle = fopen($file, "r");
    if ($handle == FALSE) {
        echo "Error: Failed to open file \"" . $file . "\"\n";
        return;
    }

    echo "\n\nIndexing file \"" . $file . "\"\n";

    // read file
    $line_num = 0;
    while (($line = fgets($handle)) != FALSE) {
        /*
        //echo $line;
        $line_num++;
        $len = strlen($line);
        echo "line number " . $line_num . " length = " . $len . "\n";
        */
        $pattern = "([0-9A-Za-z][0-9A-Za-z][0-9A-Za-z][0-9A-Za-z]*)";
        preg_match_all($pattern, $line, $matches, PREG_SET_ORDER);
        $match_count = count($matches);
        for ($j = 0; $j < $match_count; $j++) {
            $word = $matches[$j][0];
            //echo $word . "\n";
            $word_l = strtolower($word);
            //echo $word_l . "\n";
            process_word_l($word_l, $file, $file_rl_path);
        }
    }
    if (!feof($handle)) {
        echo "Error: unexpected fgets() fail when reading file \"" .
$file . "\"\n";
    }
    fclose($handle);

    echo "Indexing file \"" . $file . "\" completed.\n";
    $GLOBALS['num_files_processed'] = $GLOBALS['num_files_processed'] + 1;
    echo "Total files indexed = " . $GLOBALS['num_files_processed'] . "\n";

} // end of process_file

function process_word_l($word_l, $file, $file_rl_path) {

    $letter = substr($word_l, 0 , 1);
    $dir_to_check = $GLOBALS['index_dir'] . "/" . $letter;
    $file_to_check =  $dir_to_check . "/" . $word_l;
    $content_without_newline = $GLOBALS['prefix'] . $file_rl_path;
    $content = $content_without_newline . "\n";

    //create file if file does not exist
    if (file_exists($file_to_check) != TRUE) {
        //echo "\"" . $file_to_check . "\" does not exist. Creating it..\n";
        if (file_put_contents($file_to_check, $content) == FALSE) {
            echo "Error: file_put_contents failed for file \"" .
$file_to_check . "\"\n";
        }
        return;
    }

    //echo "debug: file_to_check = " . $file_to_check . "\n";
    //echo "debug: file_to_check = " . $file_to_check . "\n";
    //echo "debug: file_to_check = " . $file_to_check . "\n";
    //echo "debug: file_to_check = " . $file_to_check . "\n";

    $handle = fopen($file_to_check, "r+");
    if ($handle == FALSE) {
        echo "Error: Failed to open file \"" . $file_to_check . "\"\n";
        return;
    }

    // check if entry exists and if not then append at the end
    while (($line = fgets($handle)) != FALSE) {
        if ($line === $content) {
            //echo "Entry \"" . $content_without_newline . "\" already
exists in file \"" . $file_to_check ."\"\n";
            return;
        }
    }
    if (!feof($handle)) {
        echo "Error: unexpected fgets() fail when reading file \"" .
$file_to_check . "\"\n";
    }
    fwrite($handle, $content);
    fclose($handle);

} // end of process_word_l

echo "\n\n**** Indexing complete.**** \n\n";

?>

------------------------
search_index.php
------------------------

<?php

/* This program searches for search words in index files. This program
 * requires the path to directory where a directory called
index_directory exists.
 * This index_directory contains 36 subdirectories named 0, 1, 2, ..,
9 and a, b, c, .., y, z.
 * The index files are present in these subdirectories.
 */

function print_usage()
{
    echo ("Usage:\n\n" .
          "  Syntax:\n\n" .
          "    search_index OPTION[S] [search_word[s]...]\n\n" .
          "  Description:\n\n" .
          "    search_index searches for search_word[s] in index
files. One or more\n" .
          "    search words can be specified. This program requires
the path to directory\n" .
          "    where a directory called index_directory and its
subdirectories (0-9, a-z)\n" .
          "    exist. The index files are present in these
subdirectories.\n\n" .
          "  Options:\n\n" .
          "    -i path_to_index_directory (MANDATORY option)\n" .
          "        Use -i option to specify the path to directory
where directory\n" .
          "        called index_directory exist.\n\n" .
          "    --help\n".
          "        Print this usage/help and exit.\n");
} // end of print_usage

$iOptionPresent = FALSE;
$index_dir_parent = "";
$index_dir = "";
$search_keyword_array = array();
$search_results_array = array();

for ($i = 1; $i < $argc; $i++) {
    echo "debug: Argument/Option " . $i . ": " . $argv[$i] . "\n";
    $arg = $argv[$i];
    if ($arg[0] === '-') {
        if ($arg === "--help") {
            print_usage();
            exit(0);
        } else if ($arg === "-i") {
            $iOptionPresent = TRUE;
            if (($i+1) < $argc) {
                $index_dir_parent = $argv[$i+1];
                $index_dir = $index_dir_parent . "/" . "index_directory";
                $i++;
                continue;
            }
        } else {
            echo "search_index: Unknown option: " . $arg . "\n";
            echo "Try search_index --help to see the help.\n";
            exit(1);
        }
    } else {
        array_push($search_keyword_array, $arg);
    }
} // end of for loop

// debug info
echo "\nDEBUG_INFO_START:\n\n";
if ($iOptionPresent === TRUE) {
    echo "-i option is present.\n";
    echo "index_dir_parent = " . $index_dir_parent . "\n";
} else {
    echo "-i option is NOT present.\n";
}

$num_entries = count($search_keyword_array);
echo "Entries in search_keyword_array are:\n";
for ($i = 0; $i < $num_entries; $i++){
    echo $search_keyword_array[$i] . "\n";
}
echo "\nDEBUG_INFO_END\n\n";
// end debug info

if ($index_dir_parent == "") {
    echo "search_index: Please give the path to directory where
index_directory exist.\n";
    echo "Try search_index --help to see the help.\n";
    echo "Exiting..\n";
    exit(1);
}

if (file_exists($index_dir_parent) != TRUE) {
    echo "search_index: \"" . $index_dir_parent . "\" does not exist.\n";
    echo "Please give a valid path to directory where index_directory exist.\n";
    echo "Try search_index --help to see the help.\n";
    echo "Exiting..\n";
    exit(1);
}

if (is_dir($index_dir_parent) != TRUE) {
    echo "search_index: \"" . $index_dir_parent . "\" is not a directory.\n";
    echo "Please give a valid path to directory where index_directory exist.\n";
    echo "Try search_index --help to see the help.\n";
    echo "Exiting..\n";
    exit(1);
}

if (file_exists($index_dir) != TRUE) {
    echo "search_index: \"index_directory\" does not exist in \"" .
$index_dir_parent . "\".\n";
    echo "Please give a valid path to directory where index_directory exist.\n";
    echo "Try search_index --help to see the help.\n";
    echo "Exiting..\n";
    exit(1);
}

if (is_dir($index_dir) != TRUE) {
    echo "search_index: index_directory \"" . $index_dir . "\" is not
a directory.\n";
    echo "Please give a valid path to directory where index_directory exist.\n";
    echo "Try search_index --help to see the help.\n";
    echo "Exiting..\n";
    exit(1);
}

if (count($search_keyword_array) < 1) {
    echo "search_index: No search word given for searching.\n";
    echo "Try search_index --help to see the help.\n";
    exit(0);
}

$num_entries = count($search_keyword_array);
for ($i = 0; $i < $num_entries; $i++) {
    $word = $search_keyword_array[$i];
    $word_l = strtolower($word);
    $letter = substr($word_l, 0 , 1);
    $dir_to_check = $GLOBALS['index_dir'] . "/" . $letter;
    $file_to_check =  $dir_to_check . "/" . $word_l;

    if (file_exists($file_to_check) != TRUE) {
        continue;
    }
    if (is_file($file_to_check) != TRUE) {
        continue;
    }
    $handle = fopen($file_to_check, "r");
    if ($handle == FALSE) {
        //echo "Error: Failed to open file \"" . $file_to_check . "\"\n";
        continue;
    }

    while (($line = fgets($handle)) != FALSE) {
        // remove newline from line
        $line = str_replace(array("\n", "\r"), '', $line);
        //$old_value = $search_results_array[$line];
        //if (($old_value == NULL) || ($old_value == FALSE)) {
        //    $old_value = 0;
        //}
        if (array_key_exists($line, $search_results_array) == FALSE) {
          $search_results_array[$line] = 1;
        } else {
          $search_results_array[$line]++;
        }
    }
    if (!feof($handle)) {
        echo "Error: unexpected fgets() fail when reading file \"" .
$file_to_check . "\"\n";
    }
    fclose($handle);
} // end of for loop

// dump search_results_array after sorting
arsort($search_results_array);
//var_dump($search_results_array);
$keys = array_keys($search_results_array);
$num_entries = count($keys);
for ($i = 0; $i < $num_entries; $i++) {
    echo $keys[$i]. "\n";
} // end of for loop

?>

----------------
ReadMe.txt
----------------

Architecture of this Search Engine
----------------------------------------------

This search engine has a new architecture compared to other search engines.

I invented and implemented this new search engine architecture.

This search engine has been developed mainly for English Alphabet. This search
engine is based on the fact that no letter in English Alphabet has more than
30,000 words starting with it. This search engine works on text/html files only.

This search engine was mainly developed so that it could be used on
websites. So,
now websites can integrate this search engine on their platform so
that a user can
search anything on their website. The website can index all their
pages through this
search engine and also give a search box to the user. The websites now
do not have to
rely on third party search engines.

The structure of the search index is that there is a top level directory called
index_directory. This directory has 36 folders. The folders are named:
0, 1, 2, .., 8, 9 and
a, b, c, .., y, z. Every word has an index file name with the same
name in the directory
which starts with the same letter as the word. So, since no letter has more than
30,000 words starting with it, there will be at max only 30,000 files
in that directory.
These days modern OSes can handle many more files in one directory.

For example, if the word is "server", then there will be a file in
"index_directory/s" folder
called "server". This file will contain the path of all documents that
contain the word
"server".

So, the contents of the file server can be:
https://www.myexample.com/abcd.html
https://www.myexample.com/1234.html
https://www.myexample.com/hello.html

These three html documents contain the word "server". Now, if someone wants to
search for the word "server" then the contents of this file will be printed on
the output page/screen which means that these 3 documents contain the
word "server".

Now, let's suppose there is another word called "hello". So, there
will be a file
in "index_directory/h" called hello and this will contain the path of
all documents
that contain the word "hello".

Let's suppose that the index file "hello" has following contents:
https://www.myexample.com/xyz.html
https://www.myexample.com/new.html
https://www.myexample.com/hello.html

Now, if someone search for both keywords "server" and "hello", the
output will be:
https://www.myexample.com/hello.html
https://www.myexample.com/abcd.html
https://www.myexample.com/1234.html
https://www.myexample.com/xyz.html
https://www.myexample.com/new.html

So, you see that "https://www.myexample.com/hello.html"; is the first URL to be
printed because it contains both "server" and "hello" words. So, the document
which contains most number of search words will be printed first and
then documents
which contain less number of search words. So, basically the printing is sorted
in descending order according to the number of search words present in
the document.

There are three programs developed in PHP in this Search Engine. So, it will run
on all platforms that have PHP installed. The three programs are:

* create_index_directories.php
* create_index_or_add_to_existing_index.php
* search_index.php

* create_index_directories.php: This program creates index directories
for storing
  index files. Required argument: Path to directory where the top level index
  directory and its subdirectories will be created. The top level index
  directory will be named index_directory.

  Usage:

    Syntax:
        create_index_directories [OPTIONS] [dir_path]

            Description:
                create_index_directories creates index directories for storing
                index files. "dir_path" is the path to directory where the top
                level index directory and sub directories will be created.
                The top level index directory will be named index_directory.

            Options:
                --help
                    Print this usage/help and exit

* create_index_or_add_to_existing_index.php: This program takes
files/directories as arguments
  and parses the files (present in directories or given on command
line) to create the
  search index files or add to already existing index files. The
directories are processed
  recursively if -r option is given. This program also requires the
path to directory where
  a directory called index_directory exists. This index_directory
  contains 36 folders named 0, 1, 2, .., 9 and a, b, c, .., y, z.
  Index files are created in subdirectories of index_directory. This program
  works on text/html files only. You can use program
create_index_directories.php
  to create index_directory and its subdirectories.

    Usage:

        Syntax:
            create_index_or_add_to_existing_index OPTION[S] [FILE...] [DIR...]

            Description:
                create_index_or_add_to_existing_index parses a file
and creates search index files
                or adds to already existing index files. It works on
text/html files only.
                The file can be given as an argument or it may be
present in a directory which itself has been
                given as an argument. This program also requires the
path to directory
                where a directory called index_directory and its
subdirectories (0-9, a-z) exist.
                You can use program create_index_directories.php to create
                index_directory and its subdirectories. The paths to file/dir to
                be indexed should be relative to server_root_directory_path
                (to be given by specifying -s option).

            Options:
               -i path_to_index_directory (MANDATORY option)
                  Use -i option to specify the path to directory where directory
                  called index_directory and its subdirectories (0-9,
a-z) exist.
                  Index files are created in subdirectories of index_directory.

               -r
                  Specify -r option to process directory/directories
recursively.

               -p prefix_path
                  Please give a prefix to add before the file path
that will be written to
                  index files. It could be something like
https://mywebsite.com. If the
                  file path abcd/tyr.html is going to be written to
index file then it
                  will actually write
https://mywebsite.com/abcd/tyr.html in the index\
                  file if -p option is present.

               -s server_root_directory_path (MANDATORY option)
                  The \"absolute\" path to server root directory (from
where index.html or index.php will be served).
                  The paths to file/dir to be indexed should be
relative to server_root_directory_path.

              --help
                  Print this usage/help and exit.

    So, basically the file to be indexed is found by combining
server_root_directory_path
    and path to files/directories given on command line while the file contents
    to be written is formed by combining prefix and path to
files/directories given
    on command line.

* search_index.php: This program searches for search words in index
files. This program
  requires the path to directory where a directory called
index_directory exists.
  This index_directory contains 36 subdirectories named 0, 1, 2, .., 9
and a, b, c, .., y, z.
  The index files are present in these subdirectories.

    Usage:

        Syntax:
            search_index OPTION[S] [search_word[s]...]

        Description:
            search_index searches for search_word[s] in index files. One or more
            search words can be specified. This program requires the
path to directory
            where a directory called index_directory and its
subdirectories (0-9, a-z)
            exist. The index files are present in these subdirectories.

            Options:
              -i path_to_index_directory (MANDATORY option)
                  Use -i option to specify the path to directory where directory
                  called index_directory exist.

              --help
                  Print this usage/help and exit.

Example
-----------
There are three programs developed in PHP in this Search Engine. So, it will run
on all platforms that have PHP installed. I have used xampp/PHP on Windows to
develop this search engine so I will give an example on how to use it
on Windows.

Step 1:
---------
Let's suppose that you have installed xampp in C:\ on Windows. So, your server
root directory will be C:\xampp\htdocs.

Step 2:
---------
Let's suppose that you have copied all search engine files in C:\search_engine.

Step 3:
---------
Now, let's create index_directory and its subdirectories in your server root
directory, which is C:\xampp\htdocs. The command and output is given below:

C:\search_engine>php create_index_directories.php C:\xampp\htdocs

Created directory C:\xampp\htdocs/index_directory
Created directory C:\xampp\htdocs/index_directory/0
Created directory C:\xampp\htdocs/index_directory/1
Created directory C:\xampp\htdocs/index_directory/2
Created directory C:\xampp\htdocs/index_directory/3
Created directory C:\xampp\htdocs/index_directory/4
Created directory C:\xampp\htdocs/index_directory/5
Created directory C:\xampp\htdocs/index_directory/6
Created directory C:\xampp\htdocs/index_directory/7
Created directory C:\xampp\htdocs/index_directory/8
Created directory C:\xampp\htdocs/index_directory/9
Created directory C:\xampp\htdocs/index_directory/a
Created directory C:\xampp\htdocs/index_directory/b
Created directory C:\xampp\htdocs/index_directory/c
Created directory C:\xampp\htdocs/index_directory/d
Created directory C:\xampp\htdocs/index_directory/e
Created directory C:\xampp\htdocs/index_directory/f
Created directory C:\xampp\htdocs/index_directory/g
Created directory C:\xampp\htdocs/index_directory/h
Created directory C:\xampp\htdocs/index_directory/i
Created directory C:\xampp\htdocs/index_directory/j
Created directory C:\xampp\htdocs/index_directory/k
Created directory C:\xampp\htdocs/index_directory/l
Created directory C:\xampp\htdocs/index_directory/m
Created directory C:\xampp\htdocs/index_directory/n
Created directory C:\xampp\htdocs/index_directory/o
Created directory C:\xampp\htdocs/index_directory/p
Created directory C:\xampp\htdocs/index_directory/q
Created directory C:\xampp\htdocs/index_directory/r
Created directory C:\xampp\htdocs/index_directory/s
Created directory C:\xampp\htdocs/index_directory/t
Created directory C:\xampp\htdocs/index_directory/u
Created directory C:\xampp\htdocs/index_directory/v
Created directory C:\xampp\htdocs/index_directory/w
Created directory C:\xampp\htdocs/index_directory/x
Created directory C:\xampp\htdocs/index_directory/y
Created directory C:\xampp\htdocs/index_directory/z

Step 4:
---------
Now, let's suppose that all files to be indexed are in the directory
files_to_be_indexed
in your server root directory (C:\xampp\htdocs\files_to_be_indexed).
We can give files
also on command line but in this example I am giving a directory.

Now, the command to create index from the files in files_to_be_indexed
is given below:

C:\search_engine>php create_index_or_add_to_existing_index.php -r -i
C:\xampp\htdocs -p http://localhost -s C:\xampp\htdocs
files_to_be_indexed

Step 5:
---------
Now, let's search for four words "server hello stop start".

The command and output is given below:

C:\search_engine>php search_index.php -i C:\xampp\htdocs server hello stop start

http://localhost/files_to_be_indexed/2/catalina_service.txt
http://localhost/files_to_be_indexed/3/ctlscript.html
http://localhost/files_to_be_indexed/readme_de.txt
http://localhost/files_to_be_indexed/readme_en.html
http://localhost/files_to_be_indexed/3/4/5/filezilla_start.html
http://localhost/files_to_be_indexed/3/4/5/filezilla_stop.html
http://localhost/files_to_be_indexed/3/4/mercury_start.html
http://localhost/files_to_be_indexed/3/catalina_stop.txt
http://localhost/files_to_be_indexed/2/apache_stop.txt
http://localhost/files_to_be_indexed/3/4/5/mysql_stop.html
http://localhost/files_to_be_indexed/3/4/mysql_start.html

---- End of code and ReadMe ----



[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux