Project 2: Gitlet, your own version control system

A. Overview of Gitlet

In this project you'll be implementing a version control system. This version control system mimics some of the basic features of the popular version control system git, but it is smaller and simpler, so we have named it gitlet.

A version control system helps you save snapshots of your files at different points in time. That way, if you mess them up later, you can return to earlier versions. In gitlet, a snapshot of files is referred to as a commit.

In this project, it will be helpful for us to visualize the commits we make over time. Suppose we have a file wug.txt, we add some text to it, and commit it. Then we modify the file and commit these changes. Then we modify the file again, and commit the changes again. Now we have saved three total snapshots of this file, each one further in time than the previous. We can visualize these commits like so:

Three commits

Here we've drawn an arrow indicating that each commit contains some kind of reference to the commit that came before it — this will be important later. But for now, does this drawing look familiar? That's right; it's a linked list!

The big idea behind gitlet is that once we have this list of commits, it's very easy for us to restore old versions of files. You can imagine making a command like: "Gitlet, please revert to the state of the files at commit #2", and it would go to the second node in the linked list and restore the copies of files found there.

If we tell gitlet to revert to an old commit, the front of the linked list will no longer reflect the current state of your files, which might be a little misleading. In order to fix this problem, we introduce something called a head pointer. A head pointer keeps track of where in the linked list we're currently "at". Normally, as we make commits, the head pointer will stay at the front of the linked list, indicating that the latest commit reflects the current state of the files:

Simple head

However, let's say we revert to the state of the files at commit #2. We move the head pointer back to show this:

Reverted head

So what happens if we change wug.txt here, and make a new commit?

First, let's refine our idea of what a commit is. Recall that the head pointer is supposed to indicate something like the current state of the files, and a commit is a snapshot of the current state of our files. What we now say is this: when we commit, we'll add a new node to the front of the head pointer. Then, we'll move head pointer to the new commit. So if we do another commit now...

New commit

Something weird happened! Normally, when we repeatedly make commits in a row, we just keep appending to the front of head, so we end up with a list. But notice that if we revert backward and then commit from the middle, the thing branches! We no longer have a list of commits, but a tree of commits.

What happened is that now we don't just have old and new versions of our file. We have alternate versions of it. Maybe we did this because we're not sure which version of the file is better, so we want to keep both around temporarily.

Since we're not sure which version of the file we like better, we think it might be helpful to switch between them whenever we want. To make this easier, we might want to label them. We can imagine making a command like: "Gitlet, please label one of these versions A, and the other B!"

Two versions

So there is a version A and a version B going here. Now you can tell Gitlet to switch back and forth between them at will, and make commits on each one separately, developing each independently.

Two developed versions

In gitlet, these different versions are formally referred to as branches. The idea is that each branch has its own head node, which is the node at the front of the branch.

Two heads

In this project, you'll write commands for committing, branching, and grabbing files from all around the commit tree. A detailed spec of how this should work follows this section.

But a last word here: one feature of the commit tree that it is in some sense immutable: once a commit node has been created, it can never be destroyed (or changed at all). We can only add new things to the commit tree, not modify existing things. This is an important feature of gitlet! Remember, it's a version control system, and one of our goals with it is to allow us to save things so we don't delete them accidentally.

B. Detailed Spec of Behavior

Overall Spec

The only structure requirement we’re giving you is that you have a class named Gitlet and that it has a main method. Here’s your skeleton code for this project:

public class Gitlet {
    public static void main(String[] args) {
    }
}

You may, of course, write additional java classes to support your project. But don’t use any external code (aside from JUnit), and don’t use any programming language other than Java. You can use all of the Java Standard Library that you wish.

The majority of this spec will describe how Gitlet.java's main method must react when it receives various arguments which correspond to commands to the gitlet system. But before we break down command-by-command, here are some overall guidelines the whole project should satisfy:

Allow the user to input an answer, and only follow through with the command if the user types yes.

C. The Commands

initialize

add

commit

Here's a picture of before-and-after commit:

Before and after commit

remove

log

Notice there is a ==== separating each commit. There is also an empty line between each commit. Also notice that commits are displayed with the most recent at the top. By the way, there's a class in the Java standard library that will help you format the dates really easily. Look into that instead of trying to construct it manually yourself!

Here's a picture of the history of a particular commit. If the current branch's head pointer happened to be pointing to that commit, log would print out information about the circled commits:

History

Note that it ignores other branches and the future. Now that we have the concept of history, let's refine what we said earlier about the commit tree being immutable. It is immutable precisely in the sense that the history of a commit with a particular id may never change, ever. If you think of the commit tree as nothing more than a collection of histories, then what we're really saying is that each history is immutable.

global log

find

status

Notice there is an empty line between each section. The order of branches/files within each section does not matter.

checkout

Checkout is a kind of general command that can do a few different things depending on what its arguments are. There are 3 possible use cases. In each section below, you'll see 3 bullet points. Each corresponds to the respective usage of checkout.

In addition, you might wonder: what happens if you have a file name that's the same as a branch name? In this case, let the branch name take precedence.

branch

All right, let's see what branch does in detail. Suppose our state looks like this:

Simple history

Now we call java Gitlet branch cool-beans. Then we get this:

Just called branch

Hmm... nothing much happened. Let's switch to the branch with java Gitlet checkout cool-beans:

Just switched branch

Nothing much happened again?! Okay, say we make a commit now. Modify some files, then java Gitlet add... then java Gitlet commit....

Commit on branch

I was told there would be branching. But all I see is a straight line. What's going on? Maybe I should go back to my other branch with java Gitlet checkout master:

Checkout master

Now I make a commit...

Branched

Phew! So that's the whole idea of branching. Did you catch what's going on? All creating a branch does is give us a new head pointer. At any given time, one of these head pointers is considered the currently active head pointer (indicated by *). We can switch the currently active head pointer with checkout. Whenever we commit, it means we add a new commit in front of the currently active head pointer, even if one is already there. This naturally creates branching behavior.

Make sure that the behavior of your branch, checkout, and commit match what I've described above. This is pretty core functionality of gitlet that many other commands will depend upon. If any of this core functionality is broken, very many of our autograder tests won't work!

remove branch

reset

merge

Furthermore, the real git handles merge conflicts differently than gitlet. The real git will splice the two conflicted files together into a single file, then ask the user to pick and choose the correct sections manually. Gitlet does not do this, instead just adding in the .conflicted copy. Furthermore, git will put you in a special state where the commands you can run are limited until you finish resolving the merge conflict. Gitlet does no such thing.

rebase

Rebase has one special case to look out for. If the current branch is in the history of the given branch, rebase just moves the current branch to point to the same commit that the given branch points to. No commits are replayed in this case.

There's one more point to make about rebase: If after the split point the given branch contains modifications to files that were not modified in the current branch, then these modifications should propagate through the replayed branch. If both the given branch and the current branch have modifications to the same files, then what you would expect to happen is that you would get conflicted files, much like merge. However, for simplicity, we're not going to have you deal with conflicts: in this case, just use the current branch's copies of the files.

Finally, after any successful rebase command, update the files in the working directory to reflect the versions of the files at the new head of the current branch.

By the way, if there are multiple branches after the split point, you should NOT replay the other branches. For example, say we are on branch branch1 and we make the call java Gitlet rebase master: Branching rebase

interactive rebase

D. Miscellaneous Things to Know about the Project

Phew! That was a lot of commands to go over just now. But don't worry, not all commands are created equal. Many are just minor bookkeeping command and will only take about a line of code. Merge and rebase are lengthier commands than the others, so don't leave them for the last minute!

Anyway, by now this spec has given you enough information to get working on the project. But to help you out some more, there are a couple of things you should be aware of:

The strategy we recommend for dealing with this is to write your objects to a file before ending the program. Then, next time you start the program, you first read back in the state of the objects from the file. Luckily, this is very easy to do this in Java. Look into the java.io.Serializable interface!

~~Warning: serializing and deserializing takes time proportional to the number of objects you are serializing or deserializing. Keep this in mind when thinking about the runtimes of your commands!~~ Update: The staff has decided that in order to reduce the difficulty of the serialization, you can generally ignore serialization time, except with the caveat that your serialization time cannot depend in any way on the total size of files that have been added, committed, etc. Students are encouraged to hit the original goal if they feel up to it, though!

E. Submission and Grading

Like normal, push your code to submit/proj2 in order to submit. Push to ag/proj2 to submit to the autograder.

Be aware that the public autograder for this project is extremely barebones, and your code will mostly be graded on a secret autograder. The public autograder is essentially nothing more than a sanity check on the most basic commands. This means that in order to ensure your code works, you'll have to test it yourself! I guarantee you that your code will not work if you don't test it thoroughly yourself.

To help you test your own code, we've provided three very simple test cases you can look at, in the file GitletPublicTest.java. These are exactly the tests that will run if you push to ag/proj2. We recommend you base your own tests on these examples. The utility methods this class provides should make your testing much easier. Don't worry about understanding them fully — you can use them just trusting their abstraction.

By the way, you should also try running your code from the command line and use it just like git! Don't only test with JUnit. In addition, if you're using Windows, be sure to test out your code and tests on a linux/mac machine, such as the lab computers. You want to make sure that your code does not only work in a Windows environment, since our autograders will be run in linux.

About grading: because so many of the commands depend on one another, we cannot grade each one separately. Instead, we have to test sequences of commands together. Be aware that if any of the core functionality is broken (namely add, commit, branch, checkout, log), then many of our tests may break, and you will end up with little points. Make sure they work exactly as the spec describes! Although the fringe functionality is more difficult and time consuming to write (like merge, rebase), fewer tests depend on these methods, so they won't impact your grade as much.

For 0.1 points of extra credit, ensure that every one of your methods has a descriptive comment.

F. Stretch Goal: Going Remote

This project is all about mimicking git’s local features. These are useful because they allow you to backup your own files and maintain multiple versions of them. However, git’s true power is really in its remote features, allowing collaboration with other people over the internet.

This project’s stretch goal is to implement some basic remote commands: namely add-remote, rm-remote, push, pull, and clone. You will get this project's gold points from completing them. This stretch goal will be significantly more challenging than the rest of the project: don't attempt or plan for it until you have completed the rest of the project. In fact, I don't even recommend reading the remainder of the spec until you've completed the rest of the project.

Setting Up scp

Okay, have you finished the rest of the project now? Let's go on then. But before describing how the specific remote commands work, let’s first go over the basics of how the commands can interact with the internet.

All of the remote commands should work off scp (or pretend like they do). scp is a terminal command that allows you to copy files back and forth between computers across the internet. Your remote gitlet commands should work on anything that you have scp access to. That means, before beginning coding this part of the project, you should check to make sure you can use scp from the terminal. Note that scp is a unix command; Windows users will have to either use git bash, or use the lab computers for this.

Anyway, you should have scp access to your user account on the berkeley lab computers. For example, try out a command like:

scp [some file] cs61b-[xxx]@torus.cs.berkeley.edu:[some other file]

Where [xxx] is your login, [some file] is the name of a file on your local computer, and [some other file] is what you want the file to be named when you copy it onto the lab computers. torus.cs.berkeley.edu is the name of a computer on campus; you can use other ones too, such as cory.eecs.berkeley.edu.

After you do the scp command, you’ll be prompted for your password to log-in to your account on the lab computer.

Unfortunately, it just won’t do to have to enter your password every time you run gitlet’s remote commands. So, we’re actually going to need to take advantage of scp’s password-less login features.

So let’s revise what we said earlier to: Your remote gitlet commands should work on anything that you have password-less scp access to.

In order to get yourself password-less login to stuff over scp, you’ll want to set up an ssh key.

You can look up guides for setting up password-less ssh online. For example, this guide on github has some instructions on creating an ssh key. Only steps 1 - 3 will be relevant to you, though, because you don't want to add the ssh key to your github account; you want to add your ssh key to your user account on the berkeley lab computers. For instructions specifically about the lab computers, you might want to check out inst's help page here (see the sections SSH Public and Private Keys (passphrases) and Password-less Logins (OpenSSH)), though the instructions aren't as clear as github's are.

You can look up other resources too, if these aren't good enough for you. Keep in mind that setting up scp is not supposed to be the difficult part of this project! If you get stuck, ask questions.

All right, have you gotten password-less scp working? Great! Now you should be able to get your gitlet commands to work in Java. Let's move on.

Note: the simplest way to get Java to transfer files over scp is probably just to make Java call terminal commands; though there are more legit ways using scp in Java, you’re not required to use them. That said, please do not just make Java directly call terminal commands in the other portions of the project; take advantage of the file system abstractions that Java offers.

Windows users: If you make Java call terminal commands, it won't use git bash; it will use the regular command line. The is problematic because then you can't use scp. The way to remedy this is to set your command line to be able to use the unix git bash commands like scp. Try to figure out how to do this, and ask if you have questions!

The Commands

All right, now that you've gotten scp working, onto the rest of the project!

A few notes about the remote commands:

So now let's go over the commands:

add remote

remove remote

push

This command only works if the remote branch's head is in the history of the current local head, which means that the local branch contains some commits in the future of the remote branch. In this case, append all of the future commits to the remote branch. Then, move the remote branch head to the front of the future commits (so it's head will be the same as the local head). This is called fast-forwarding.

There is one additional use case of push: If there is no gitlet system currently on the remote, push will actually initialize it there. Just copy over the entire gitlet state to the remote machine. Ignore the given branch name in this case.

Or, if the gitlet system on the remote machine exists but does not have the input branch, then simply add the branch to the remote gitlet.

pull

clone

There is one final note to make about the remote commands. If you think about the remote commands hard enough, you'll realize that using an arbitrary id scheme won't work in the hypothetical scenario where you're collaborating with a friend using a remote. If you and your friend make different commits on each of your local machines that end up with the same id, then the remote commands will break.

One way real git solves this problem is that each commit's id is generated from its contents and metadata. The id number is not simply an arbitrary id number, but a hash. So, if you and your friend make different commits, they have to end up with different ids.

So to do the remote commands, please change your arbitrary id scheme to a hashing id scheme. The hashes should be determined at least from the commit's message, time, and hash of its parent (if it has one). This means it's okay to get collisions in a scenario where you have two commits that have the same message, time, and history. However, it is imperative you don't get collisions otherwise. To ensure this, look into using an existing hashing algorithm, rather than writing your own. Also, to help you avoid collisions, commit ids can be long hex numbers instead of just integers.

G. Acknowledgements

Thanks to Josh Hug, Sarah Kim, Austin Chen, Andrew Huang, Yan Zhao, Matthew Chow, and especially Alan Yao for providing feedback on this project. Thanks to git for being awesome.

This project was largely inspired by this excellent article by Philip Nilsson.

This project was created by Joseph Moghadam with the help of Alicia Luengo.