COP 2805 (Java II) Project
Building a Search Engine, Part II:  Persistent Data

Due: by the start of class on the date shown on the syllabus

Background:

Please read the background information and full project description from Search Engine Project, Part I.  In this project, you will implement the persistent data (the “model”) part of the project: the saving of data and the loading of data at the next start.  The persistent data contains the list of files used in the index, and the index itself.

First have your group discuss which persistence solution you will use: text files, XML or JSON files, or a database (and chose between embedded (my suggestion) or server, and if using a database, chose between the JDBC and JPA database APIs (I suggest JPA).  You can make this decision before knowing the details of the data structures used.

Before working on actual code, your group needs to decide on the data structures to be used for the file list and the inverted index.  (It would be easier if we had covered Java collections before now, but you cannot do everything first.)  Try to read the Java collections material before deciding.  If your group is lost or cannot agree, please see your instructor, who will be happy to provide guidance.  (Note there is a sample solution in the hints section, below.)  But you cannot read or write persistent data from collections until you decide on what collections you will use!

It should be easy to add and remove files (from the set of indexed files).  When starting, your application should check if any of the files used have been changed or deleted since the application last saved the index.  If so, the “admin” user should be able to have the inverted index file(s) updated, from the maintenance interface.

(Note that with HTML or Word documents, you would need to extract a plain text version before indexing.)  In this project, all the “indexible” files are plain text.  You are free to assume the system-default text file encoding, or assume UTF-8 encoding, for all files.

The inverted index can be stored in one or more file(s), and that should be read whenever your application starts.  The file(s) should be updated (or recreated) when you add, update, or remove documents from your set (of indexed documents).  The file format is up to you, but should have a format that is fast and simple to search.  However, to keep things simpler, in this project you can assume that only a small set of documents will be indexed, and thus the whole index can be kept in memory.  All you need to do is be able to read the index data from a file at startup into memory, and write it back when updating the index.  Don't forget the names (pathnames) of the files as well as their last modification time must be stored as well.  It is your choice to use a single file or multiple files, in plain text, JSON, XML, or any format your group chooses, to hold the persistent data.  If your group wants, they can use any DBMS.  (In that case, I suggest using the JavaDB included with the JDK, as an embedded database.)  In any case, your file format(s) or database schema must be documented completely, so that someone else, without access to your source code could use your file(s) or database correctly.

If using XML format, you can define an XML schema for your file and have some tool such as Notepad++ validate your file format for you.  XML may have other benefits, but it isn't as simple as plain text files or even JSON files.  In any case, don't forget to include the list of file (path) names, along with the index itself, in your persistent data store.

Part II Requirements:

The class must work in groups of three or four students per group.  Any student not part of a group must let the instructor know immediately.  In this case, the instructor will form the groups.

This project has been split into three parts.  Each part counts as a separate project.  In the first part, your group designed and implemented a (non-functional) graphic user interface for the application.  If necessary, you can alter that for this project, but only if necessary.

Your group will agree to use a single GitHub repo for this project.  Every student must make commits to this repo for their part of the project.  (So every member of the project must do their share of the code.)  Please review the team organization and Git workflow from the previous project.

In this part, you must implement the file operations of your search engine application (the model).  That includes reading and updating your persistent data (that is, the inverted index as well as any other information you need to store between runs of your application, such as the list of files (their pathnames) that have been indexed).  The main file operations are reading each file to be indexed a “word” at a time; you also need to checking if the previously indexed files still exist or have been modified since last indexed.

The maintenance part of the user interface should allow users to select files for indexing, and to keep track of which files have been added to the index.  For each file, you need to keep the full pathname of the file as well as the file's last modification time.  Your code should correctly handle the user entering in non-existent files and unreadable files.  (How you handle such errors is up to your group; make sure you document the group's decisions someplace.)

You can download a Search Engine model solution, to play with it and inspect its user interface.  My solution keeps all persistent data in a single text file in the user's home directory, but you can certainly use a different persistence solution.

Hints:

Keep your code simple.  You can always add features later, time permitting.  If you start with a complex, hard-to-implement design, you may run out of time.

Commit frequently.  If you hit a dead end with your code, you may need to back it up to an older version and work in a new direction.

Please review the hints from part I (the UI) of the project, for suggestions on team organization and Git workflow.

Possible Data Structures you can use.  In part III, you will implement the index operations, including Boolean searching, adding to the index, and removing files from the index.  (The index is a complex collection of collections.)  Because the format of the index and file list will affect the code used to read and write them to and from storage, your group must decide on the in-memory data structures to be used early.  In the model solution, I used a List of FileItem objects for the list of indexed files; each FileItem contained a file's pathname and date it was read for the index.  The index data itself is stored in a Map, with the using the indexed words as keys, and a Set of IndexData objects as the values.  Each IndexData object holds the id of the file containing the word and the position of the word in that document.  (The classes FileItem and IndexData were trivial to write.)

This is NOT the only, or the best, way to represent the index or file list!  (For example, a List of int[2] arrays might be simpler than a Set of IndexData objects.)  Your team should discuss it and decide on the types of collections used.  Only then can you implement the methods to read and write the data.

To be turned in:

A working link to your group's GitHub repo used for this project and your individual peer ratings (see below).  Your project's final version should receive a Git tag of “SearchEngine Project - Data”, so I know which version to grade.

Be sure the names of all group members are listed in the comments of all files.  You can use GitHub's issue tracker and wiki, email, Facebook, Skype, or any means you wish to communicate within your group.  (It is suggested you hold short group meetings before or just after class.)

Grading will be done for the group's results and individual commits.  Individuals in the group will have their grades adjusted by peer ratings.  A rating of each team member's level of participation should be sent by individually by every member, directly to the instructor.  Be sure to include yourself in the ratings!  The rating is a number from 0 (didn't participate at all), 1 (less than their fair share of the work), 2 (participated fully), or 3 (did more than their fair share of the work).  Additional comments are allowed if you wish to elaborate.

Send your group ratings, including the GitHub link, as email to (preferred).

Please see your syllabus for more information about projects, and about submitting projects.