COP 2805C (Java II) Project
Building a Search Engine, Part II:  Persistent Data

Due: by the start of class on the date shown on the syllabus

Background:

Please read the background information and full project description from Search Engine Project, Part I.  In this project, you will implement the persistent data (the “model”) part of the project: the saving of data and the loading of data at the next start.  The persistent data contains the list of files used in the index and their information (such as their last modification date), and the index itself.

As with the previous project (the “view”), your model code doesn't do much at this time.  You cannot load data into some collection, or read some collection and persist that data to storage, without first designing the data structures used by the “control” part of your code (the actual logic of your code).

What you can and should do is to have your model code provide a set of methods (an “API”) to do the necessary work on persistent data that the view and control code can use.  For now, those methods do not need to do anything.  (However, you can implement some of the code, say by allowing the maintenance view to add a file.  Then you can persist that list of file names, planning on refactoring that code later.)

First have your group discuss which persistence solution you will use: text files, structured text files such as XML or JSON files, or a database (and choose between embedded (my suggestion) or server).  If using a database, choose between the JDBC and JPA database APIs (I suggest JPA).  You can make this decision before knowing the details of the data structures used.

The inverted index can be stored in one or more file(s) or database tables, and that should be read whenever your application starts.  The file(s)/database should be updated (or recreated) either when you add, update, or remove documents from your set (of indexed documents), or just once when the program starts up and when it shuts down.  The file format or database schema is up to you, but should be kept simple for this project.

Before working on actual code, your group needs to decide on the data structures to be used for the file list and the inverted index.  (It would be easier if we had covered Java collections before now, but you cannot do everything first.)  Try to read the Java collections material before deciding.  If your group is lost or cannot agree, please see your instructor, who will be happy to provide guidance.  (Note there is a sample solution in the hints section, below.)  But you cannot read or write persistent data from collections until you decide on what collections you will use!

Your model code must provide an API to allow the following: add and remove files from the ones used to build the index; check if any added files have changed or no longer exist; allow a way to query the data (so the view code can display that information in the maintenance view); return the saved index data (used to rebuild the data structures at program launch); and save the data structures in memory to persistent storage.

Notice the coupling between the model and the controller (the next project); you cannot write a proper API to save or load an unknown data structure.  Once the data structure(s) for the final part of the search engine have been decided upon, you can then change that part of the model's API.  (For example, “public void loadData( List )” initially will likely change List to the actual type of your collection, possibly a Map.)

This coupling is already annoying, but it also means if you change the data structures used in the controller later, you must also update your model code.  There are design patterns in Java to handle this, but you need to know more in order to split up the code better.  So we will not worry about the poor design your instructor has imposed, for this project.

As with the previous project, you need not implement all that functionality; the methods are allowed to be “stub methods” for now.  However, you do know enough to make the file names part of your API functional at this time.  (Hint: use a List.)

It should be easy to add and remove files (from the set of indexed files), so your API must have methods for those tasks.  When starting, your application should check if any of the files used have been changed or deleted since the application last saved the index, so the model must provide a method for doing that.  Of course your API must allow the in-memory index to be saved to storage, and also load the index from storage.  Your API may include additional methods, and possibly other items (such as enums), if your group decides that it's a good idea.

Once you complete the API for the model, you can go back and finish the maintenance view, by making the button event handlers invoke the appropriate methods of the model.  (Naturally, you won't see working results since the model's methods are just stubs at this point.  As you implement functionality, the views will become real.)

(Note that with HTML or Word documents, you would need to extract a plain text version before indexing.)  In this project, all the “indexable” files are plain text.  You are free to assume the system-default text file encoding, or assume UTF-8 encoding, for all files.  (It is possible, and not very difficult, to extract text from nearly any document type.  However you will not have time to such extra features, so don't even try.)

To keep things simple, in this project you can assume that only a small set of documents will be indexed, and thus the whole index can be kept in memory at once.  So all you need to do is be able to read the index data at startup into memory, and write it back when updating the index or just when the program shuts down. 

Don't forget the names (pathnames) of the files as well as their last modification time must be stored, in addition to the index data.  Whether you use files or a database, your file format(s) or database schema must be documented completely, so that someone else without access to your source code could use your file(s) or database correctly.

If using XML format, you can define an XML schema for your file and have some tool such as Notepad++ validate your file format for you.  XML may have other benefits, but it isn't as simple as plain text files or even JSON files.  In any case, don't forget to include the list of file (path) names, along with the index itself, in your persistent data store.

Part II Requirements:

The class must work in groups of three or four students per group.  Any student not part of a group must let the instructor know immediately.  In this case, the instructor will form the groups.

This project has been split into three parts.  Each part counts as a separate project.  In the first part, your group designed and implemented a (non-functional) graphic user interface for the application.  If necessary, you can alter that for this project, but only if necessary.

Your group will agree to use a single GitHub repo for this project.  Every student must make commits to this repo for their part of the project.  (So every member of the project must do their share of the code.)  Please review the team organization and Git workflow from the previous project.

In this part, you must design the API for the persistence operations of your search engine application (the model), as described above.  That includes reading and updating your persistent data (that is, the inverted index as well as any other information you need to store between runs of your application, such as the list of files (their pathnames) that have been indexed).  You also need the ability to check if the previously indexed files still exist or have been modified since last indexed.

The maintenance part of the user interface should allow users to select files for indexing, and to keep track of which files have been added to the index.  For each file, you need to keep the full pathname of the file as well as the file's last modification time.  Your code should correctly handle the user entering in non-existent files and unreadable files.  (How you handle such errors is up to your group; make sure you document the group's decisions someplace.)

You can download a Search Engine model solution, to play with it and inspect its user interface.  My solution keeps all persistent data in a single text file in the user's home directory, but you can certainly use a different persistence solution.

Hints:

Keep your code simple.  You can always add features later, time permitting.  If you start with a complex, hard-to-implement design, you may run out of time.

Have each team member work in their own branches.

Commit frequently.  If you hit a dead end with your code, you may need to back it up to an older version and work in a new direction.

Follow your group's agreed workflow, so the Git repo does not become corrupted.

Please review the hints from part I (the UI) of the project, for suggestions on team organization and Git workflow.

Possible Data Structures you can use.  In part III, you will implement the index operations, including Boolean searching, adding to the index, and removing files from the index.  (The index is a complex collection of collections.)  In a proper model-view-controller design, the in-memory data structures are part of the model, not the controller.  (We didn't do that here because of various constraints imposed by the class, such as you not learning data structures (collections) in time to do the model properly).

Because the format of the index and file list will affect the code used to read and write them to and from storage, your group should decide on the in-memory data structures to be used as soon as possible, and then go back and refactor your model API to use the correct collection types in the method arguments.  In the model solution v1.0, I used a List of FileItem objects for the list of indexed files; each FileItem contained a file's pathname and date it was read for the index.  (In version 2.0, I use a List of Path from the NIO.2: one less class to write!)

The index data itself is stored in a Map, using the indexed words as keys, and a Set of IndexData objects as the values.  Each IndexData object holds the id of the file containing the word and the position of the word in that document.  (The classes FileItem and IndexData were trivial to write, as their objects are just data records.)

This is NOT the only (nor the best) way to represent the index or file list!  (For example, a List of int[2] arrays might be simpler than a Set of IndexData objects.)  Your team will discuss it and decide on the types of collections used for the next part of this project.  Only then can you implement the methods in the model to read and write the data.

To be turned in:

A working link to your group's GitHub repo used for this project.  Your project's final version should receive a Git tag of “SearchEngine Project - Data”, in the master branch, so I know which version to grade.

Be sure the names of all group members are listed in the comments of all files.  You can use GitHub's issue tracker and wiki, email, Facebook, Skype, or any means you wish to communicate within your group.  (It is suggested you hold short group meetings before or just after class, in addition to the in-class time for group project work.)

Grading will be done for the group's results and individual commits. 

Send the GitHub link once for your team, as email to (preferred).  (It is suggested to "CC" all group members when sending that email, so any feedback comments will be sent to all (I'll use "Reply All").

Please see your syllabus for more information about projects, and about submitting projects.