COP-2805 Project - Building a Search Engine Part I (the UI)

In computer science, an inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a document or set of documents (or database). The purpose of an inverted index is to allow fast, full-text searches. It is the most popular data structure used in document retrieval systems, used on a large scale (for example) in search engines.

Simple Inverted Indexes

There are two main variants of inverted indexes. The simplest is an inverted file index (or just inverted file). This type of index contains a Map of references to documents, for each word. The word is the map key, and the set of documents is the value. For example, indexing these three (very short) documents:

Example Documents
File	Contents
File0.txt	it is what it is
File1.txt	what is it
File2.txt	it is a banana

results in the following inverted file index:

Inverted File Index for Example Documents
Word	Set of Files containing word
a	{2}
banana	{2}
is	{0, 1, 2}
it	{0, 1, 2}
what	{0, 1}

Typically, a small integer is used to represent each file. So “0”, “1”, and “2” in this example refer to the documents “File0.txt”, “File1.txt”, and “File2.txt”. The curly braces denote “sets”. As you can see, the index is nothing more than a map of these sets, with one set for each word that appears in at least one document. The key for the map is the word, and the value is the set. (A list of file names is part of the index, but not shown here. This additional List is used to convert the numbers back to file names.)

An AND search for the terms “what”, “is” and “it” would give the set (“∩” means set intersection):

{0,1} ∩ {0,1,2} ∩ {0,1,2} = {0,1}

In this example, the word “what” appears in files 0 and 1, the words “is” and “it” both appear in all three files. With an AND search, we need the set of documents that have all three terms. All three search terms appear only in documents 0 and 1.

An OR search for the same terms would give the set (“∪” means set union):

{0,1} ∪ {0,1,2} ∪ {0,1,2} = {0, 1, 2}

That is, all three documents. With an OR search, the results include the documents that include at least one of the search terms.

An AND search for the terms “banana” and “what” results in no files found (we say the number of hits was zero). There are no files that contain both of these words.

A PHRASE search isn't possible with a simple inverted index.

Full Inverted Indexes

A full inverted index is the same as the simple inverted index, but additionally contains the positions of each word within a document. This offers more functionality (like phrase searches), but needs more to create and use. Instead of a set of document numbers, you need a set of pairs of numbers. The number pairs (or tuples) are the document numbers and the position of the word within that document, starting with zero for the first word. So the map entry:

banana: {(2, 3)}

means the word “banana” is in document 2 (File2.txt), and it is the fourth word in that document (position 3). With the full inverted index, instead of a set of document numbers for each word, you have a set of pairs (document number plus position) for each word. Here is the full inverted index for the three sample files:

Full Inverted Index for Example Documents
Word	Set of (File,Position) Pairs
a	{(2, 2)}
banana	{(2, 3)}
is	{(0, 1), (0, 4), (1, 1), (2, 1)}
it	{(0, 0), (0, 3), (1, 2), (2, 0)}
what	{(0, 2), (1, 0)}

As you can see, the full index is a map of sets of pairs, with one set for each word. Each set contains pairs (document number + position). Note that some words appear in multiple positions in the same document, so you have multiple entries in that set. Real-world data structures are often complicated like this, with lists of sets of whatever.

For AND and OR searches, the position doesn't matter and can be ignored. (Doing AND and OR searches is fairly straight-forward, and is left as an exercise for the reader.)

If we run a phrase search for “what is it”, we get “hits” for all three search terms in File0.txt and File1.txt. But the terms occur consecutively only in File1.txt, so a PHRASE search should only show that one document. How can you use a full inverted index to perform a PHRASE search? There are several ways; here's one way:

You need to produce the set of documents that contain all the search terms, in order. You can determine this by first creating a result set of all the document+position pairs that contain the first word, in any position. Only these documents can possibly contain the whole phrase. (Note that a word such as “it” appears in two places in File0.txt, so either might be the start of the phrase. You need to list both pairs in the result set.) The result set at this point shows the pairs for “what” looks like this:

{(0, 2), (1, 0)}

This list tells us that only documents zero or one could possibly contain the phrase “what is it”. Now we have to see which of these (if any) is followed by the next word in the phrase. That is, we need to check if document zero contains the word “is” in position 3, and if document one contains that word in position 1.

Create a second result set that is initially empty. For each document+position pair in the first result set, check if the next word (“is” in our example) appears in the same document, but in the following position. If not, that entry can't possibly indicate the phrase in question. But if so, you add the document+position pair for the second word to the second result set. In the example phrase search, you need to see if document zero contains the word “is” at position 3, and if document one contains that word at position 1. The new result set looks like this:

{(1, 1)}

As you can see, only document one can possible contain the phrase: it has “what” at position 0 and “is” at position 1. Once the new result set has been built, you no longer need the original one.

Repeat the previous steps for each word in the phrase, until you either run out of words (and the resulting set of documents are just those that contain the phrase) or until you get an empty result set (in which case, the phrase doesn't appear in any of the indexed documents). In our example, we need to see if document one contains the word “it” at position 2. After this, the new result set looks like this:

{(1, 2)}

Having run out of words in our phrase, we are done! Document one contains the phrase in positions 0 through 2.

This is a group project. It is also a real-world, complex application. It is my hope that when done, you will understand files, collections, and the process of software development much more fully than you would by reading the textbook alone. You must kept your project as simple as possible, with any “extras” saved for a later version (if any). Keep the user interface plain and simple. Time management will be critical for success: set milestones for your project, with due dates. (For example, milestone 1 might be the basic user interface, milestone 2 might be the administrators interface, and so on.)

Students should form their own groups of no more than four students per group (unless permission to form a larger group is given). It will be up to you to determine how to organize and manage your group, and when, where, how, and how often to meet.

Build a simple GUI search engine that can do AND, OR, and PHRASE searches on a small set of text files. The user should be able to say the type of search to do, and enter some search terms. The results should be a list of files that match the search. This should be a stand-alone application.

Using NetBean's GUI Builder is allowed, but don't expect your instructor to read the resulting generated code. Remember you will not be graded on the beauty of your user interface! As long as it supports the requirements of the project, that is good enough. My suggestion is to use this opportunity to learn to code a Java GUI without using any sort of GUI builder. (AWT is probably easiest, but your group is free to use Swing or Oracle's JavaFx if you wish. (Don't expect your instructor to be able to help with JavaFx!)

It should be easy to add and remove files (from the set of indexed files), and to regenerate the index anytime, from a separate administration window. When starting, your application should check if any of the files have been changed or deleted since the application last saved the index. If so, the user should be able to have the inverted index file(s) updated.

(Note that with HTML, Word, or other types of documents, you would need to extract a plain text version before indexing. For these projects, limit your search engine to only plain text files.)

The inverted index must be stored in one or more file(s), and that file should be read whenever your application starts. The file(s) should be updated (or recreated) when you add, update, or remove documents from your set (of indexed documents). The file format is up to you, but should have a format that is fast and simple to search. However, to keep things simpler, in this project you can assume that only a small set of documents will be indexed, and thus the whole index can be kept in memory. All you need to do is be able to read the index data from a file at startup into memory, and write it back when updating the index. Note, the names (pathnames) of the files as well as their last modification time must be stored as well. It is your choice to use a single file or multiple files, plain text, XML, or a databse to hold the persistent data. In any case, your file format(s) (or database schema) must be documented completely, so that someone else without access to your source code could use your file(s).

If using XML file, you can define an XML schema for it and have some tool such as Microsoft's XML Notepad utility or Notepad++ validate your file format for you. XML may have other benefits, but it isn't as simple as using plain text files. In any case, don't forget to include the list of file (path) names and other data you decide is needed, along with the index itself.

The class must work in groups of three or four students per group. Any student not part of a group must let the instructor know immediately. In this case, the instructor will form the groups.

Your group will agree to use a single GitHub repo for this project. Every student must make commits to this repo for their part of the project. (So every member of the project must do their share of the code.)

This project has been split into three parts. Each part counts as a separate project. In this first part, your group must design and implement a non-functional graphic user interface for the application. The user interface must support both searching as well as various administration operations: adding files to the index, remove files from the index, and update the index (when files have been modified since they were indexed).

The user interface should be complete, but none of the functionality needs to be implemented at this time. You should implement stub methods for the functionality not yet implemented, and invoke them from your event handlers. The stub methods can either return “canned” (fake but realistic) data, or throw an OperationNotSupported exception.

You can download a Search Engine model solution to play with it and inspect its user interface. Please keep in mind you should not copy that user interface; instead, invent a better one.

Preview of next projects: In part II (data), your group will implement the file or database operations of the search engine. That includes reading and writing the inverted index and file list from/to storage, reading other files a word at a time, and storing file metadata such as last modified time. In part III (collections), you will implement the inverted index operations, including Boolean searching, adding to the index, and removing files from the index. (The index is a complex data structure.)

Keep your code simple. You can always add features later, time permitting. If you start with a complex, hard-to-implement design, you may run out of time.

Keep in mind the requirements for the remaining parts of these projects, as that will affect what is needed in your user interface.

How your group is organized is up to the group members. Some suggestions include:

Pick a group name that can be used as a legal Java package name.
Create a new project (module) on GitHub.com, for your group's files. Name the new project after your group (so the other groups in our class won't use the same name). When you create the repo on GitHub, include a readme.md file; doing so will be helpful when cloning the repo.
Decide on a project structure: pick one member as the project lead. (You should pick different project leads for subsequent projects, so all your team members have a chance. The project lead can break deadlocks when your group cannot agree on something, arrange group meetings, etc.
After meeting once or twice, your team should agree on some aspects of the project organization: text encoding and line endings for files, and project organization. Your group must agree on package and class (and thus file) names. The project lead can update the readme.md file in your team repo with that information, or even create skeleton classes.
One possibility is for each member of the group to implement the project independently (once your group agrees on package and class names). Then your group can compare the results, and merge the best ideas of each into your group's master repo. This organization works well when team members are having difficulty agreeing on the code design.
Another possibility is to assign different parts to each member, with a code review by all group members, before the project due date.
Still another possibility is to define what must be done and add an issue for each. Then team members can take possession of one or more issues, or several team members can work on a single issue and work out a best solution.
Don't forget the branch features of the versioning software; once the project has been initially committed, group members can create their own branches for development. (You should have a master branch for the final submission version, containing commit objects from every team member.)
It is also possible to have different group members responsible for different classes, or different methods (if you don't have enough classes for everyone).

Git Workflow Suggestions:

In the previous project, we used a very simple workflow without any branches. That is unlikely to work for most real-world projects. While your group can decide on any workflow you wish, Here's a suggestion:

After initial GitHub repo creation, you should clone that repo on your PC and create a Java project for it.
Next, create and checkout a personal branch for your work. While you can create many branches as you develop and try out different ideas (or work on different issues), have one as your personal main development branch. (Do not work in the master branch.)
Now do some work, committing often in your branch. Once your code is ready for code review (or if you need help from your teammates), you can push your branch up to the GitHub project's repo. Now others can see your code. (If your code is long, you should consider making a pull request on GitHub, which may be easier for your teammates to review.)
If your code needs some work or you need to incorporate suggestions from the code review, go back to work as before. When the final version is approved, you can merge in your branch's commits into the master branch:
1. Before attempting a merge, do a pull from GitHub to your repo. Do not forget this step! Your local copy of the master branch may not contain the latest commits unless you do this.
2. Make sure you switch to your (local) master branch before you do anything else. Then merge in your work from your personal branch.
3. Once that is successful, push the revised master branch up to GitHub.

If your team organization is that multiple members work on the same code and the team votes for the final version to turn in, you need a different workflow to ensure every team member's commits are in the master branch, which is the only branch that will be examined for grading:

No commits are made on the master branch until the final version has been chosen by your group.
The team lead then merges the HEAD of each member's personal branch into master, starting with those that did not win. End by merging the final version selected by your team. Note that these merges are easy to do, since you always ignore any conflicts by using the version from the branch you are merging in.

(When you work on projects from different organizations, there are many different workflows that are popular. The ones suggested here may not (and probably won't) be the same as the workflow you will need to use then.)

A working link to your group's GitHub repo used for this project and your individual peer ratings (see below). Your project's final version should receive a Git tag of “Proj 3 UI”, so I know which version to grade.

Be sure the names of all group members are listed in the comments of all files. You should use GitHub's issue tracker, wiki, email, Facebook, Skype, or any means you wish, to communicate within your group. (It is suggested you hold short group meetings before or just after class.)

Grading will be done for the group's results and individual commits. Individuals in the group will have their grades adjusted by peer ratings. (This is useful since some group members may contribute more with design, help, and code reviews, than they do by making commits.) A rating of each team member's level of participation should be sent by individually by every member directly to the instructor. Be sure to include yourself in the ratings! The rating is a number from 0 (didn't participate at all), 1 (less than their fair share of the work), 2 (participated fully), or 3 (did more than their fair share of the work). Additional comments are allowed if you wish to elaborate.

Send your group ratings, including the GitHub link, as email to (preferred).

Please see your syllabus for more information about projects, and about submitting projects.

COP 2805 (Java II) Project
Building a Search Engine, Part I: The UI

Due: **by the start of class** on the date shown on the syllabus

Background:

Simple Inverted Indexes

Full Inverted Indexes

Search Engine Projects Overall Description:

Part I Requirements:

Hints:

Git Workflow Suggestions:

To be Turned in:

COP 2805 (Java II) Project Building a Search Engine, Part I: The UI

Due: by the start of class on the date shown on the syllabus

Background:

Simple Inverted Indexes

Full Inverted Indexes

Search Engine Projects Overall Description:

Part I Requirements:

Hints:

Git Workflow Suggestions:

To be Turned in:

COP 2805 (Java II) Project
Building a Search Engine, Part I: The UI

Due: **by the start of class** on the date shown on the syllabus