/home/wpollock1/public_html/restricted/Java2/SearchEngine/com/wpollock/searchengine/FileFormat.txt
Index File Format
Version 1.0
The file SrchNgn.dat (Search Engine Data) will be text with long lines,
in UTF-8 encoding, and DOS end of line markers (<CR><NL>).
The first part of the file will list the files used to generate the
index, one per line. A blank line then separates the file list from
the index data.
The first line of the file will start with the text:
SearchData 1.0
which indicates the version of the file format, in case it changes someday.
The next line contains the next Unique file ID to use (an unsigned integer).
This should initially be zero. For example:
0
After these two initial version lines, each file is listed, one per line,
in this format:
<ID> <TAB> <pathname of file> <TAB> <time of last modification>
where "<ID>" is the unique document ID, <pathname of file> is a string
for the absolute pathname on the local system of the file (note, filenames
can't contain TAB characters), and "<time of last modification>" is
an unsigned integer (a long) representing a standard timestamp (milliseconds
since the epoch, 00:00:00 GMT, January 1, 1970). An example might be:
1 C:\Temp\file.txt 1329170774139
After the blank line that marks the end of the list of files, the inverted
index data runs to the end of the file. There is one line for each word,
with the first field the word and the remaining fields the pairs of numbers
that represent the document ID and position within the document. The
pair of numbers is separated with a comma. Fields are separated with
white-space. Leading or trailing white space is ignored. For example:
apple 0,12 3,0, 3,19, 3,1262, 12,0
(JSON might be a better choice for version 2.)