COP-2805C Project - Building a Search Engine Part I (the workflow and UI)

This is a group project. It is also a real-world, complex application, split into several projects. (This is the first project in this series.) It is my hope that when done, you will understand files, collections, and the process of software development much more fully than you would by reading the textbook alone.

You are going to design, build, and test a scaled-down version of “Google Search”. Rather than searching the Internet's files, you will only search local files added to your search engine's index. Your search engine will allow an administrator to add, update, and remove files from the index. Users will be able to enter search terms, and select between Boolean AND, OR, or PHRASE search. The matching file names (if any) are then displayed in a list.

Before you start coding, you need to organize your group, assign responsibilities, decide on your group's workflow, communications, etc. You also need to design the system architecture (the high-level design), so you can plan each part. Finally, you will need to set up a Git repo for the search engine for your group, make initial commits (for example, a README file describing the policies your group has decided upon), and set up procedures for pushing commits, code reviews, and issue reporting.

Since nobody in our class has much or any experience with these issues and we won't cover design until too late to start coding, I will provide guidelines and suggestions for all these aspects in this project.

You must keep your projects as simple as possible, with any “extras” saved for a later version (if any). Keep the user interface plain and simple. Time management will be critical for success: set milestones for your project, with due dates. (For example, milestone 1 might be the basic user interface, milestone 2 might be the administrators' interface, and so on.)

Students should form their own groups of no more than four students per group (unless permission to form a larger group is given). The ideal group size is three students. It will be up to you to determine how to organize and manage your group, and when, where, how, and how often to meet. (Some class time will be given for groups to meet and work together on their projects.) No student is allowed to work alone.

It is difficult to know who to work with. In the past, students could meet and get to know a little about each other during class, but we don't have that option available. Instead, class time will be reserved for students to meet and greet. On top of that, students can meet outside of class on the class discussion board or using other social media sites. You only have about a week to decide on your team. Any students not in a team will be placed in one randomly by your instructor.

Build a search engine with simple GUI, that can do AND, OR, and PHRASE Boolean searches on a small set of text files. The user should be able to say the type of search to do, and to enter some search terms. The results should be a list of file pathnames that match the search. This should be a stand-alone application, but you can seek permission if your group wishes to try to create Java EE web application instead. (You may not succeed if you try, but you will certainly learn a lot!)

User Interfaces:

Using your IDE's GUI Builder is allowed, but don't expect your instructor to read the resulting generated code. Remember you will not be graded on the beauty of your user interface! As long as it supports the requirements of the project, that is good enough. My suggestion is to use this opportunity to learn to code a Java GUI without using any sort of GUI builder. AWT is probably easiest, but your group is free to use Swing or Oracle's JavaFx instead (or a web-based user interface if you choose to try Java EE). (Don't expect your instructor to be able to help with JavaFx!)

In addition to the main user interface (for doing searching), you will need a separate administrator or maintenance interface to manage your application. It should be easy to add and to remove files (from the set of indexed files), and to regenerate the index anytime. When starting, your application should check if any of the files have been changed or deleted since the application last saved the index. If so, the administrator should be able to have the index updated with the modified file(s).

Note that with HTML, Word, or other types of documents, you would need to extract a plain text version before indexing. That isn't hard, but the search engine is complex enough already. For these projects, limit your search engine to only plain text files (including .txt, .html, and other text files).

The index must be stored on disk, so next time your application starts it can reload its data. The index, list of files, and other data, can be stored in one or more file(s) or in a database. The saved data should be read whenever your application starts. The saved data should be updated (or recreated) when you add, update, or remove documents from your set (of indexed documents), or perhaps just when your application exits. If you use files, the file formats are up to you; have a format that is fast and simple to load and store.

To keep things as simple as possible, in this project you can assume that only a small set of documents will be indexed, and thus the whole index can be kept in memory at once. (That's probably not the case for Google's data!) All you need to do is be able to read the index data from disk at startup into memory, and write it back either when updating the index, or when your application shuts down. Note, the names (pathnames) of the added files as well as their last modification time must be stored in addition to the index.

It is your group's choice to use a single file or multiple files, and to choose a format for those files (plain text, XML, JSON, or anything else). Or don't use files, and instead use a database to hold the persistent data. In any case, your file format(s) (or database schema) must be documented completely, so that someone else without access to your source code could use your file(s)/database.

If using XML file, you can define an XML schema for it and have some tool such as Notepad++ validate your file format for you. XML may have other benefits, but it isn't as simple as using plain text files. JSON might be the easiest format for storing and reading the index data. In any case, don't forget to include the list of file pathnames and other data you decide is needed, along with the index itself.

The class must work in groups of three or four students per group, except with your instructor's permission. Any student not part of a group must let the instructor know immediately, at least a week before the project is due (two weeks would be better). In case students are not part of an approved group, the instructor will form the groups, or shift students from one group to another.

Your group will agree to use a single GitHub repo as the official source repository for this project. It should use a Maven directory structure. Once this main repo is created and configured (possibly with some content already such as a README file, pom.xml, or .gitignore), each member of the group should clone the repo for their use. You can fork it on GitHub first (optional), then clone the fork to your personal computer.

Since different members may have different IDEs, use a Maven project template for your code. All team members should also adjust their cloned project's settings (such as line endings and encoding) to match the team's choices.

Every student must make commits to this repo for their part of the project. (So every member of the project must do their share of the code.) That repo can be kept in any group member's GitHub account; you will need to have all other group members listed as collaborators on that one repo.

Your first task is to develop a group workflow plan. You will then use that process to develop the first part of the project, the user interface.

In this project, we will follow the model-view-controller design pattern for the project organization. This allows one to develop each part mostly independently from the other parts.

How your group is organized is up to the group members. Some suggestions include:

Pick a group name that can be used as a legal Java package name.
Create a new project (module) on GitHub.com, for your group's files. Name the new project after your group (so the other groups in our class won't use the same name). When you create the repo on GitHub, include a readme.md file; doing so will be helpful when cloning the repo.
Decide on a project structure: pick one member as the project lead. (You should pick different project leads for subsequent projects, so all your team members have a chance.) The project lead can break deadlocks when your group cannot agree on something, arrange group meetings, etc.
After meeting once or twice, your team should agree on some aspects of the project organization: text encoding and line endings for files, and project organization. Your group must agree on package and class (and thus file) names. The project lead can update the readme.md file in your team repo with that information, or even create skeleton classes.
One possibility is for each member of the group to implement the project independently (once your group agrees on package and class names). Then your group can compare the results, and merge the best ideas of each into your group's main repo. This organization works well when team members are having difficulty agreeing on the code design.
Another possibility is to assign different classes to each member (or possibly just different methods in one class), with a code review by all group members, before the project due date. Only reviewed and approved code will be pushed into the main branch of the group's official repo.
Still another possibility is to define what must be done and add an issue for each. Then team members can take possession of one or more issues, or several team members can work on a single issue and work out a best solution.
Don't forget the branch features of the versioning software; once the project has been initially committed, group members can create their own branches for development. (You should have a main branch for the final submission version, containing commit objects from every team member.)
It is also possible to have different group members responsible for different classes, or different methods (if you don't have enough classes for everyone).

Once you all have agreed on your team's organization and work procedures, write them down. That will need to be submitted for this project, along with a list of team members, the group name, and the URL of the official Git repo on GitHub. (You can put this information in your repo's readme file.)

In the previous project, we used a very simple workflow without any branches. That is unlikely to work well for most real-world projects. While your group can decide on any workflow you wish, Here's a suggestion:

Read Comparing Git Workflows, then discuss as a group which model you want.
After initial GitHub repo creation, you should clone that repo on your PC and create a Java project for it. (You can also fork it on GitHub, then clone that to your PC.)
Next, create and checkout a personal branch for your work. While you can create many branches as you develop and try out different ideas (or work on different issues), have one as your personal main development branch. (Do not work in the main branch.)
Be sure to have an appropriate .gitignore file, so the team members' project files don't get pushed accidentally.

To create the initial repo, create a completely empty repo on GitHub. Next, your team leader (or another designated team member) creates a new Maven project using their preferred IDE. Tweak that if desired, then make the initial Git commit locally. You can add a ReadMe file, documenting your groups agreements for things such as workflow, file encoding, line endings, etc. Finally, you can push your commit to the GitHub repo created earlier. When done, the other team members can clone it on their computers.
Now do some work, committing often in your branch. Once your code is ready for code review (or if you need help from your teammates), you can push your branch up to the GitHub project's repo. Now others can see your code. (If your code is long, you should consider making a pull request on GitHub, which may be easier for your teammates to review.)
If your code needs some work or you need to incorporate suggestions from the code review, go back to work as before. When the final version is approved, you can merge in your branch's commits into the main branch:
1. Before attempting a merge, do a pull from GitHub to your repo. Do not forget this step! Your local copy of the main branch may not contain the latest commits unless you do this.
2. Make sure you switch to your (local) main branch before you do anything else. Then merge in your work from your personal branch.
3. Once that is successful, push the revised main branch up to GitHub.

No commits are made on the main branch until the final version has been chosen by your group.
The team lead then merges the HEAD of each member's personal branch into main, starting with those that did not win. End by merging the final version selected by your team. Note that these merges are easy to do, since you always ignore any conflicts by using the version from the branch you are merging in.

(When you work on projects in the real world, there are many different workflows that are popular. The ones suggested here may not (and probably won't) be the same as the workflow you will need to use then. Consider trying out several workflows, just for the experience.)

In this part of the project, your group must implement a non-functional (that means looks good but doesn't do anything) graphic user interface for the application. (This is the “view”.) The main (default) user interface must support searching and displaying results. It should have various other features, such as an “About...” menu or button, a way to quit the application (if a stand-alone application; if your group creates a web application, there is no need to quit), and a way to get to the administrator/maintenance view. For now, none of the buttons or links need to work.

The maintenance/administrator view must allow the user to perform various administration operations: view the list of indexed file names, add files to the index, remove files from the index, and update the index (when files have been modified since they were indexed). Again, nothing needs to be functional at this time.

The user interface should be complete, but none of the functionality needs to be implemented at this time. You should implement stub methods for the functionality not yet implemented, and invoke them from your event handlers. The stub methods can either return “canned” (fake but realistic) data, or throw an UnsupportedOperationException. The only button that needs to do anything is the one used to show the maintenance view.

Since the user interfaces don't do anything, there is nothing to test yet. However, you must create a test class with at least one test method (it can just return success if you wish). I suggest your group agree to use JUnit 5 style tests; JUnit 4 is also acceptable.

You can download a Search Engine model solution to play with it and inspect its user interface. Please keep in mind you do not (and should not) have to copy that user interface; instead, invent a better one!

In part II (data), your group will implement the file or database operations of the search engine (the “model”). That includes reading and writing the inverted index and file list from/to storage, reading other files a word at a time, and storing file metadata such as last modified time. This does not include creating the index nor implementing any search operations.

In part III (collections), you will implement the inverted index operations, including Boolean searching, adding to the index, and removing files from the index. (The “controller”.) (The index is a complex data structure.) At this point, your search engine should be fully functional.

In a real-world project, some additional features would be required. You would likely require someone to authenticate before allowing them to use the maintenance view. All such operations (and any errors) should be logged. Commonly used search terms should be remembered to provide business intelligence. The code should be instrumented to allow measurement and reporting of performance. The application might need to support multiple languages. All the code would need auditing for security, license compliance, possibly ISO 9000 compliance, and quality assurance. Finally, a way to deploy the code and to provide updates and other support should be planned and implemented. None of this is required this term.

Keep your code simple. You can always add features later, time permitting. If you start with a complex, hard-to-implement design, you may run out of time.

Your team's name is a good choice for a package name. However, your project will have multiple parts. It might help to use a sub-package for each part; for example, if your team name is “PifflMinions”, then you could put the UI code in a package named “pifflminions.view.

(The style recommended above is known as decomposition by structure. In many cases, a better approach is to make one package for each major feature of your application, each containing all the classes related to that feature, so each package gets model, view, and controller classes. For our project, making classes by structure might be simpler for you.)

Keep in mind the requirements for the remaining parts of these projects, as that will affect what is needed in your user interface and code skeleton.

You should use GitHub's issue tracker and wiki. Also use email, Facebook, Skype, Canvas, Discord, or any means you wish, to communicate within your group. (It is suggested you hold short group meetings before or just after class.)

The last 20 minutes or so of most classes will be reserved for team meetings using Zoom breakout rooms. Each group will have their own room created for them, named for their team. Note that breakout rooms require a recent version of Zoom, so update your software! See the Participating in breakout rooms help from Zoom, or watch this short Zoom breakout rooms YouTube video.

Grading will be done for the group's results and individual commits. Individuals in the group will have their grades adjusted by other documentation, such as issues or wiki contributions. (This is useful since some group members may contribute more with design and code reviews than they do by making commits.) If any group member wants to explain a lack of commits, add your comments in your group's wiki site.

See the AWT and Swing tutorials from COP-2800, also the GUI Lecture notes (PDF). (There are no doubt many better tutorials on the Internet, if you look.)

COP 2805C (Java II) Project
Building a Search Engine, Part I: Governance, Workflow, and UI

Due: on the date shown on the syllabus and in Canvas

Background:

Search Engine Project Proposal:

User Interfaces:

Requirements:

Team Organization:

Git Workflow:

Develop Stub User Interfaces:

Preview of next projects:

Hints:

To be Turned in:

COP 2805C (Java II) Project Building a Search Engine, Part I: Governance, Workflow, and UI

Due: on the date shown on the syllabus and in Canvas

Background:

Search Engine Project Proposal:

User Interfaces:

Requirements:

Team Organization:

Git Workflow:

Develop Stub User Interfaces:

Preview of next projects:

Hints:

To be Turned in:

COP 2805C (Java II) Project
Building a Search Engine, Part I: Governance, Workflow, and UI