In computer science, an inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a document or set of documents (or database). The purpose of an inverted index is to allow fast, full-text searches. It is the most popular data structure used in document retrieval systems, used on a large scale (for example) in search engines.
There are two main variants of inverted indexes.
The simplest is an inverted file index
(or just inverted file).
This type of index contains a Map
of
references to documents, for each word.
The word is the map key, and the set of documents is the value.
For example, indexing these three (very short) documents:
File | Contents |
---|---|
File0.txt | it is what it is |
File1.txt | what is it |
File2.txt | it is a banana |
results in the following inverted file index:
Word | Set of Files containing word |
---|---|
a | {2} |
banana | {2} |
is | {0, 1, 2} |
it | {0, 1, 2} |
what | {0, 1} |
Typically, a small integer is used to represent each file.
So “0”, “1”, and “2” in this example
refer to the documents “File0.txt”, “File1.txt”,
and “File2.txt”.
The curly braces denote “sets”.
As you can see, the index is nothing more than a map of these sets,
with one set for each word that appears in at least one document.
The key for the map is the word, and the value is the set.
(A list of file names is part of the index, but not shown here.
This additional List
is used to convert the numbers back to
file names.)
An AND
search for the terms “what”,
“is” and “it” would give the set
(“∩” means
set
intersection):
{0,1} ∩ {0,1,2} ∩ {0,1,2} = {0,1}
In this example, the word “what” appears in files 0 and 1, the words “is” and “it” both appear in all three files. With an AND search, we need the set of documents that have all three terms. All three search terms appear only in documents 0 and 1.
An OR
search for the same terms would give the set
(“∪” means
set
union):
{0,1} ∪ {0,1,2} ∪ {0,1,2} = {0, 1, 2}
That is, all three documents. With an OR search, the results include the documents that include at least one of the search terms.
An AND
search for the terms “banana”
and “what”
results in no files found (we say the number of hits was zero).
There are no files that contain both of these words.
A PHRASE
search isn't possible with a simple inverted index.
A full inverted index is the same as the simple inverted index, but additionally contains the positions of each word within a document. This offers more functionality (like phrase searches), but needs more to create and use. Instead of a set of document numbers, you need a set of pairs of numbers. The number pairs (or tuples) are the document numbers and the position of the word within that document, starting with zero for the first word. So the map entry:
banana: {(2, 3)}
means the word “banana” is in document 2
(File2.txt
), and it is the fourth word in that document
(position 3).
With the full inverted index, instead of a set of document numbers for
each word, you have a set of pairs (document number plus position)
for each word.
Here is the full inverted index for the three sample files:
Word | Set of (File,Position) Pairs |
---|---|
a | {(2, 2)} |
banana | {(2, 3)} |
is | {(0, 1), (0, 4), (1, 1), (2, 1)} |
it | {(0, 0), (0, 3), (1, 2), (2, 0)} |
what | {(0, 2), (1, 0)} |
As you can see, the full index is a map of sets of pairs, with one set for each word. Each set contains pairs (document number + position). Note that some words appear in multiple positions in the same document, so you have multiple entries in that set. Real-world data structures are often complicated like this, with lists of sets of whatever.
For AND and OR searches, the position doesn't matter and can be ignored. (Doing AND and OR searches is fairly straight-forward, and is left as an exercise for the reader.)
If we run a phrase search for “what is it”, we get “hits” for all three search terms in File0.txt and File1.txt. But the terms occur consecutively only in File1.txt, so a PHRASE search should only show that one document. How can you use a full inverted index to perform a PHRASE search? There are several ways; here's one way:
You need to produce the set of documents that contain all the search
terms, in order.
You can determine this by first creating a result set of all
the document+position pairs that contain the first word, in any position.
Only these documents can possibly contain the whole phrase.
(Note that a word such as “it” appears in two places
in File0.txt
, so either might be the start of the phrase.
You need to list both pairs in the result set.)
The result set at this point shows the pairs for “what”
looks like this:
{(0, 2), (1, 0)}
This list tells us that only documents zero or one could possibly contain the phrase “what is it”. Now we have to see which of these (if any) is followed by the next word in the phrase. That is, we need to check if document zero contains the word “is” in position 3, and if document one contains that word in position 1.
Create a second result set that is initially empty. For each document+position pair in the first result set, check if the next word (“is” in our example) appears in the same document, but in the following position. If not, that entry can't possibly indicate the phrase in question. But if so, you add the document+position pair for the second word to the second result set. In the example phrase search, you need to see if document zero contains the word “is” at position 3, and if document one contains that word at position 1. The new result set looks like this:
{(1, 1)}
As you can see, only document one can possible contain the phrase: it has “what” at position 0 and “is” at position 1. Once the new result set has been built, you no longer need the original one.
Repeat the previous steps for each word in the phrase, until you either run out of words (and the resulting set of documents are just those that contain the phrase) or until you get an empty result set (in which case, the phrase doesn't appear in any of the indexed documents). In our example, we need to see if document one contains the word “it” at position 2. After this, the new result set looks like this:
{(1, 2)}
Having run out of words in our phrase, we are done! Document one contains the phrase in positions 0 through 2.
This is a group project. It is also a real-world, complex application. It is my hope that when done, you will understand files, collections, and the process of software development much more fully than you would by reading the textbook alone. You must kept your project as simple as possible, with any “extras” saved for a later version (if any). Keep the user interface plain and simple. Time management will be critical for success: set milestones for your project, with due dates. (For example, milestone 1 might be the basic user interface, milestone 2 might be the administrators interface, and so on.)
Students should form their own groups of no more than four students per group (unless permission to form a larger group is given). It will be up to you to determine how to organize and manage your group, and when, where, how, and how often to meet.
Build a simple GUI search engine that can do AND, OR, and PHRASE searches on a small set of text files. The user should be able to say the type of search to do, and enter some search terms. The results should be a list of files that match the search. This should be a stand-alone application.
Using NetBean's GUI Builder is allowed, but don't expect your instructor to read the resulting generated code. Remember you will not be graded on the beauty of your user interface! As long as it supports the requirements of the project, that is good enough. My suggestion is to use this opportunity to learn to code a Java GUI without using any sort of GUI builder. (AWT is probably easiest, but your group is free to use Swing or Oracle's JavaFx if you wish. (Don't expect your instructor to be able to help with JavaFx!)
It should be easy to add and remove files (from the set of indexed files), and to regenerate the index anytime, from a separate administration window. When starting, your application should check if any of the files have been changed or deleted since the application last saved the index. If so, the user should be able to have the inverted index file(s) updated.
(Note that with HTML, Word, or other types of documents, you would need to extract a plain text version before indexing. For these projects, limit your search engine to only plain text files.)
The inverted index must be stored in one or more file(s), and that file should be read whenever your application starts. The file(s) should be updated (or recreated) when you add, update, or remove documents from your set (of indexed documents). The file format is up to you, but should have a format that is fast and simple to search. However, to keep things simpler, in this project you can assume that only a small set of documents will be indexed, and thus the whole index can be kept in memory. All you need to do is be able to read the index data from a file at startup into memory, and write it back when updating the index. Note, the names (pathnames) of the files as well as their last modification time must be stored as well. It is your choice to use a single file or multiple files, plain text, XML, or a databse to hold the persistent data. In any case, your file format(s) (or database schema) must be documented completely, so that someone else without access to your source code could use your file(s).
If using XML file, you can define an XML schema for it and have some tool such as Microsoft's XML Notepad utility or Notepad++ validate your file format for you. XML may have other benefits, but it isn't as simple as using plain text files. In any case, don't forget to include the list of file (path) names and other data you decide is needed, along with the index itself.
The class must work in groups of three or four students per group. Any student not part of a group must let the instructor know immediately. In this case, the instructor will form the groups.
Your group will agree to use a single GitHub repo for this project. Every student must make commits to this repo for their part of the project. (So every member of the project must do their share of the code.)
This project has been split into three parts. Each part counts as a separate project. In this first part, your group must design and implement a non-functional graphic user interface for the application. The user interface must support both searching as well as various administration operations: adding files to the index, remove files from the index, and update the index (when files have been modified since they were indexed).
The user interface should be complete, but none of the functionality needs to be implemented at this time. You should implement stub methods for the functionality not yet implemented, and invoke them from your event handlers. The stub methods can either return “canned” (fake but realistic) data, or throw an OperationNotSupported exception.
You can download a Search Engine model solution to play with it and inspect its user interface. Please keep in mind you should not copy that user interface; instead, invent a better one.
Preview of next projects: In part II (data), your group will implement the file or database operations of the search engine. That includes reading and writing the inverted index and file list from/to storage, reading other files a word at a time, and storing file metadata such as last modified time. In part III (collections), you will implement the inverted index operations, including Boolean searching, adding to the index, and removing files from the index. (The index is a complex data structure.)
Keep your code simple. You can always add features later, time permitting. If you start with a complex, hard-to-implement design, you may run out of time.
Keep in mind the requirements for the remaining parts of these projects, as that will affect what is needed in your user interface.
How your group is organized is up to the group members. Some suggestions include:
readme.md
file;
doing so will be helpful when cloning the repo. readme.md
file in your team repo
with that information, or even create skeleton classes. In the previous project, we used a very simple workflow without any branches. That is unlikely to work for most real-world projects. While your group can decide on any workflow you wish, Here's a suggestion:
If your team organization is that multiple members work on the same code and the team votes for the final version to turn in, you need a different workflow to ensure every team member's commits are in the master branch, which is the only branch that will be examined for grading:
(When you work on projects from different organizations, there are many different workflows that are popular. The ones suggested here may not (and probably won't) be the same as the workflow you will need to use then.)
A working link to your group's GitHub repo used for this project and your individual peer ratings (see below). Your project's final version should receive a Git tag of “Proj 3 UI”, so I know which version to grade.
Be sure the names of all group members are listed in the comments of all files. You should use GitHub's issue tracker, wiki, email, Facebook, Skype, or any means you wish, to communicate within your group. (It is suggested you hold short group meetings before or just after class.)
Grading will be done for the group's results and individual commits. Individuals in the group will have their grades adjusted by peer ratings. (This is useful since some group members may contribute more with design, help, and code reviews, than they do by making commits.) A rating of each team member's level of participation should be sent by individually by every member directly to the instructor. Be sure to include yourself in the ratings! The rating is a number from 0 (didn't participate at all), 1 (less than their fair share of the work), 2 (participated fully), or 3 (did more than their fair share of the work). Additional comments are allowed if you wish to elaborate.
Send your group ratings, including the GitHub link, as email to (preferred).
Please see your syllabus for more information about projects, and about submitting projects.