COP 2805C (Java II) Project
Building a Search Engine, Part I:  Governance, Workflow, and UI

Due: by the start of class on the date shown on the syllabus

Background:

This is a group project.  It is also a real-world, complex application, split into several projects.  (This is the first project in this series.)  It is my hope that when done, you will understand files, collections, and the process of software development much more fully than you would by reading the textbook alone.

You are going to design, build, and test a scaled-down version of “Google Search”.  Rather than searching the Internet's files, you will only search local files added to your search engine's index.  Your search engine will allow an administrator to add, update, and remove files from the index.  Users will be able to enter search terms, and select between Boolean AND, OR, or PHRASE search.  The matching file names (if any) are then displayed in a list.

Before you start coding, you need to organize your group, assign responsibilities, decide on your group's workflow, communications, etc.  You also need to design the system architecture (the high-level design), so you can plan each part.  Finally, you will need to set up a Git repo for the search engine for your group, make initial commits (for example, a README file describing the policies your group has decided upon), and set up procedures for pushing commits, code reviews, and issue reporting.

Since nobody in our class has much or any experience with these issues, and we won't cover design until too late to start coding, I will provide guidelines and suggestions for all these aspects in this project.

You must kept your projects as simple as possible, with any “extras” saved for a later version (if any).  Keep the user interface plain and simple.  Time management will be critical for success: set milestones for your project, with due dates.  (For example, milestone 1 might be the basic user interface, milestone 2 might be the administrators interface, and so on.)

Students should form their own groups of no more than four students per group (unless permission to form a larger group is given).  The ideal group size is three students.  It will be up to you to determine how to organize and manage your group, and when, where, how, and how often to meet.  (Some class time will be given for groups to meet and work together on their projects.)  No student is allowed to work alone.

Search Engine Project Proposal:

Build a search engine with simple GUI, that can do AND, OR, and PHRASE Boolean searches on a small set of text files.  The user should be able to say the type of search to do, and enter some search terms.  The results should be a list of file pathnames that match the search.  This should be a stand-alone application, but you can seek permission if your group wishes to try to create Java EE web application instead.  (You may not succeed if you try, but you will certainly learn a lot!)

User Interfaces

Using your IDE's GUI Builder is allowed, but don't expect your instructor to read the resulting generated code.  Remember you will not be graded on the beauty of your user interface!  As long as it supports the requirements of the project, that is good enough.  My suggestion is to use this opportunity to learn to code a Java GUI without using any sort of GUI builder.  AWT is probably easiest, but your group is free to use Swing or Oracle's JavaFx instead (or a web-based user interface if you choose to try Java EE).  (Don't expect your instructor to be able to help with JavaFx!)

In addition to the main user interface (for doing searching), you will need a separate administrator or maintenance interface to manage your application.  It should be easy to add and remove files (from the set of indexed files), and to regenerate the index anytime.  When starting, your application should check if any of the files have been changed or deleted since the application last saved the index.  If so, the administrator should be able to have the index updated with the modified file(s).

Note that with HTML, Word, or other types of documents, you would need to extract a plain text version before indexing.  That isn't hard, but the search engine is complex enough already.  For these projects, limit your search engine to only plain text files (including .txt, .html, and other text files).

The index must be stored on disk, so next time your application starts it can reload its data.  The index, list of files, and other data, can be stored in one or more file(s) or in a database.  The saved data should be read whenever your application starts.  The saved data should be updated (or recreated) when you add, update, or remove documents from your set (of indexed documents), or perhaps just when your application exits.  If you use files, the file formats are up to you; have a format that is fast and simple to load and store.

To keep things as simple as possible, in this project you can assume that only a small set of documents will be indexed, and thus the whole index can be kept in memory at once.  (That's probably not the case for Google's data!)  All you need to do is be able to read the index data from disk at startup into memory, and write it back either when updating the index, or when your application shuts down.  Note, the names (pathnames) of the added files as well as their last modification time must be stored in addition to the index.

It is your group's choice to use a single file or multiple files, and to choose a format for those files (plain text, XML, JSON, or anything else).  Or don't use files, and instead use a database to hold the persistent data.  In any case, your file format(s) (or database schema) must be documented completely, so that someone else without access to your source code could use your file(s)/database.

If using XML file, you can define an XML schema for it and have some tool such as Notepad++ validate your file format for you.  XML may have other benefits, but it isn't as simple as using plain text files.  JSON might be the easist format for storing and reading the index data.  In any case, don't forget to include the list of file pathnames and other data you decide is needed, along with the index itself.

Requirements:

The class must work in groups of three or four students per group, except with your instructor's permission.  Any student not part of a group must let the instructor know immediately.  In case students are not part of an approved group, the instructor will form the groups, or shift students from one group to another.

Your group will agree to use a single GitHub repo as the official source repository for this project.  It should use a Maven directory structure.  Once this master repo is created and configured (possibly with some content already such as a README file, pom.xml, or .gitignore), each member of the group should clone the repo for their use.  You can clone it on GitHub first (optional), then clone it to your personal computer.

Since different members may have different IDEs, use a Maven project template for your code.

Every student must make commits to this repo for their part of the project.  (So every member of the project must do their share of the code.)  That repo can be kept in any group member's GitHub account; you will need to have all other group members listed as collaborators on that one repo.

Your first task is to develop a group workflow plan.  You will then use that process to develop the first part of the project, the user interface.

In this project, we will follow the model-view-controller design pattern for the project organization.  This allows one to develop each part mostly independently from the other parts.

Team Organization:

How your group is organized is up to the group members.  Some suggestions include:

Once you all have agreed on your team's organization and work procedures, write them down.  That will need to be submitted for this project, along with a list of team members, the group name, and the URL of the official Git repo on Github.

Git Workflow:

In the previous project, we used a very simple workflow without any branches.  That is unlikely to work for most real-world projects.  While your group can decide on any workflow you wish, Here's a suggestion:

  1. Read Comparing Git Workflows, then discuss as a group which model you want.
  2. After initial GitHub repo creation, you should clone that repo on your PC and create a Java project for it.
  3. Next, create and checkout a personal branch for your work.  While you can create many branches as you develop and try out different ideas (or work on different issues), have one as your personal main development branch.  (Do not work in the master branch.)

    To create the initial repo, create a completely empty repo on Github.  Next, your team leader (or other designated team member) creates a new Maven project using their preferred IDE.  Tweak that if desired, then make the initial Git commit locally.  Finally, you can push your commit to the GitHub repo created earlier.  When done, the other team members can clone it on their computers.

  4. Now do some work, committing often in your branch.  Once your code is ready for code review (or if you need help from your teammates), you can push your branch up to the GitHub project's repo.  Now others can see your code.  (If your code is long, you should consider making a pull request on GitHub, which may be easier for your teammates to review.)
  5. If your code needs some work or you need to incorporate suggestions from the code review, go back to work as before.  When the final version is approved, you can merge in your branch's commits into the master branch:
    1. Before attempting a merge, do a pull from GitHub to your repo.  Do not forget this step!  Your local copy of the master branch may not contain the latest commits unless you do this.
    2. Make sure you switch to your (local) master branch before you do anything else.  Then merge in your work from your personal branch.
    3. Once that is successful, push the revised master branch up to GitHub.

If your team organization is that multiple members work on the same code and the team votes for the final version to turn in, you need a different workflow to ensure every team member's commits are in the master branch, which is the only branch that will be examined for grading:

  1. No commits are made on the master branch until the final version has been chosen by your group.
  2. The team lead then merges the HEAD of each member's personal branch into master, starting with those that did not win.  End by merging the final version selected by your team.  Note that these merges are easy to do, since you always ignore any conflicts by using the version from the branch you are merging in.

(When you work on projects from different organizations, there are many different workflows that are popular.  The ones suggested here may not (and probably won't) be the same as the workflow you will need to use then.)

Develop Stub User Interfaces:

In this part of the project, your group must implement a non-functional (that means looks good but doesn't do a thing) graphic user interface for the application.  (The “view”.)  The main (default) user interface must support searching and displaying results.  It should have various other features, such as an “About...” menu or button, a way to quit the application (if a stand-alone application; if your group creates a web application, there is no need to quit), and a way to get to the administrator/maintenance view.

The maintenance/administrator view must allow the user to perform various administration operations: view the list of indexed file names, adding files to the index, remove files from the index, and update the index (when files have been modified since they were indexed).

The user interface should be complete, but none of the functionality needs to be implemented at this time.  You should implement stub methods for the functionality not yet implemented, and invoke them from your event handlers.  The stub methods can either return “canned” (fake but realistic) data, or throw an OperationNotSupported exception.  The only button that needs to do anything is the one used to switch to the maintenance view.

Since the user interfaces don't do anything, there is nothing to test yet.  However, you must create a test class with at least one test method (it can just return success if you wish).  I suggest your group agree to use JUnit 4 style tests for now.

You can download a Search Engine model solution to play with it and inspect its user interface.  Please keep in mind you do not have to copy that user interface; instead, invent a better one.

Preview of next projects:

In part II (data), your group will implement the file or database operations of the search engine (the “model”).  That includes reading and writing the inverted index and file list from/to storage, reading other files a word at a time, and storing file metadata such as last modified time.  This does not include creating the index nor implementing any search operations.

In part III (collections), you will implement the inverted index operations, including Boolean searching, adding to the index, and removing files from the index.  (The “controller”.)  (The index is a complex data structure.)  At this point, your search engine should be fully functional.

In a real-world project, some additional features would be required.  You would likely require someone to authenticate before allowing them to use the maintenance view.  All such operations (and any errors) should be logged.  Commonly used search terms should be remembered to provide business intelligence.  The code should be instrumented to allow measurement and reporting of performance.  The application might need to support multiple languages.  All the code would need auditing for security, license compliance, possibly ISO 9000 compliance, and quality assurance.  Finally, a way to deploy the code and to provide updates and other support should be planned and implemented.

Hints:

Keep your code simple.  You can always add features later, time permitting.  If you start with a complex, hard-to-implement design, you may run out of time.

Your team's name is a good choice for a package name.  However, your project will have multiple parts.  It might help to use a sub-package for each part; for example, if your team name is “PifflMinions”, then you could put the UI code in a package named “view.pifflminions.

(The style recommended above is known as decomposition by structure.  In many cases, a better approach is to make one package for each major feature of your application, each containing all the classes related to that feature, so each package gets model, view, and controller classes.  For our project, making classes by structure might be simpler for you.)

Keep in mind the requirements for the remaining parts of these projects, as that will affect what is needed in your user interface and code skeleton.

Be sure the names of all group members are listed in the comments of all files.  You should use GitHub's issue tracker, wiki, email, Facebook, Skype, or any means you wish, to communicate within your group.  (It is suggested you hold short group meetings before or just after class.)

Grading will be done for the group's results and individual commits.  Individuals in the group will have their grades adjusted by other documentation, such as issues or wiki contributions.  (This is useful since some group members may contribute more with design and code reviews, than they do by making commits.)  If any group member wants to explain a lack of commits, add your comments in your group's wiki site.

See the AWT and Swing tutorials from COP-2800, also the GUI Lecture notes (PDF).  (There are no doubt many better tutorials on the Internet, if you look.)

To be Turned in:

  1. Your team's organization and work procedures, along with a list of team members, the group's team name, and the URL of the group's official Git repo on Github.
  2. Your project's final version should receive a Git tag of “Proj 3 Design and UI”, so I know which version to grade.

Send your documents including the correct GitHub link, as email to (preferred).

Please see your syllabus for more information about projects, and about submitting projects.