Please read the background information and full project description from Search Engine Project, Part I.
In this final part of the project, you will complete the application by implementing the index functions. These include adding a file to the index, and removing a file from the index, and reading and writing the index from/to a file. (Updating the index when a file has been changed, can then be done by removing and then re-adding a file.) Other operations include searching the index for a given word, and returning a Set of pairs (document ID and position) for that word.
Finally, you will have to implement the Boolean search functions of the main user interface. (This is complex enough, that it should have been another project!) I suggest you start with an “OR” search, then worry about implementing the “AND” and “PHRASE” search functions.
When building the index, keep in mind you will need to define what you mean by “word”. One possibility is to strip out any non-digits or letters, and convert the result to all lowercase, both when you build the inverted index and when you read the search terms entered by the user. Ideally, you can use the I18N methods discussed in class to normalize the words.
The exact method depends in part on how you implement the inverted index. In the suggested implementation (a Map with words as the keys, and a List or Set of (document ID, position) pairs as the values), you could implement the Boolean searches using algorithms similar to the following (you can come up with your own if you wish):
This is the easiest one to implement.
The general idea is to start with an empty Set of matching files.
Then add to that Set, the files containing each search term;
Just search the Map for that word, and add each document found (if any).
The result is the OR search results, the files that contain any word in the
(If user inputs no search words, say “
then no files are considered as matching.)
This is done the opposite way from an OR search, and is only a little
harder to implement.
The idea is to start with a set of all files in the index.
Then for each search term, for each file in the Set, make sure that
file is contained in the index for that search term.
Remove any files from the set that don't contain that word.
The resulting final set is the documents matching all search terms.
(If user inputs no search words, say
,.”, then all files are considered as
If that isn't the behavior you want, you need to treat that as a special case.)
This is the hardest search to implement. Unlike the OR and the AND searches, with PHRASE searching, the position of the search terms in the files matters. The algorithm I came up with is:
The Set now contains Pair objects for just the files that might contain the phrase. Next, loop over the remaining words of the phrase, removing any Pairs from the set that are no longer possible phrase continuations. (Actually, I just build a new Set.)
An example may help clarify this.
Suppose the search phrase is “
big top now”.
The set initially contains all the Pair objects for the word
Let's say for example, that set looks like:
(file1,position7), (file1,position22), (file3,position4)
For each Pair object in that set, you need to see if
top” is in that same file, but the next
If so, you add the Pair object for that to the new Set.
The (inner) loop for this example checks each of the following:
Is a (file1,position8) Pair object in the Map for the word "top"?
Is a (file1,position23) Pair object in the Map for the word "top"?
Is a (file3,position5) Pair object in the Map for the word "top"?
If the answer is “yes”, then add that Pair object to the new
When this loop ends, the new set will contain the Pair objects for the
big top” (pointing to the position of the
For example, suppose “
top” is only found in
(file1,position8) and (file3,position5).
You replace the first set with this new set:
The class must work in groups of three or four students per group. Any student not part of a group must let the instructor know immediately. In this case, the instructor will form the groups.
Your group will agree to use a single GitHub repo for this project. Every student must make commits to this repo for their part of the project. (So every member of the project must do their share of the code.)
This project has been split into three parts. Each part counts as a separate project. In the first two parts, your group designed and implemented a (non-functional) graphic user interface for the application, and added all required file operations.
In this part, you must implement the remaining operations of your search engine application: the index operations, and the searching.
You can download a Search Engine model solution, to play with it and inspect its user interface, but please keep in mind you should not copy that user interface; instead, invent a better, nicer-looking one.
Keep your code as simple as possible; worry about clever extras in version 2. (You can always add features later, time permitting.) If you start with a complex, hard-to-implement design, you may (will!) run out of time.
The inverted index is naturally a Map, from words (the keys) to a Set of objects (the values). Each of the objects represent a document and a location within that document, where the word was found. I called these objects Pairs, since they are a pair of numbers, but you can use any name for your classes. Note, you will need to be able to go from a document number to a file name, when you display the search results.
How your group is organized is up to the group members. Some suggestions include:
A working link to your group's GitHub repo used for this project and your individual peer ratings (see below). Your project's final version should receive a Git tag of “SearchEngine Project - Collections”, so I know which version to grade. Also, your group should submit one set of answers to the following questions:
Be sure the names of all group members are listed in the comments of all files. You should use GitHub's issue tracker, wiki, email, Facebook, Skype, or any means you wish, to communicate within your group. (It is suggested you hold short group meetings before or just after class.)
Grading will be done for the group's results and individual commits. Individuals in the group will have their grades adjusted by peer ratings. A rating of each team member's level of participation should be sent by individually by every member directly to the instructor. Be sure to include yourself in the ratings! The rating is a number from 0 (didn't participate at all), 1 (less than their fair share of the work), 2 (participated fully), or 3 (did more than their fair share of the work). Additional comments are allowed if you wish to elaborate.
Send your group ratings, including the GitHub link, as email to (preferred).