Much of the blog content was written in conjunction with Steve Losh. Steve is a programmer, photographer, blues dancer and musician. Check out Steve’s projects to see some of the cool things he has worked on, or jump over to his Bitbucket account and get straight to the source.
Making the Switch to Distributed Version Control
Many individuals, teams, and organizations are thinking about making the switch to distributed version control systems a la Git and Mercurial (Hg). This is the first post in a series of blog entries over the next several weeks that focus on using and understanding DVCS.
Let’s start off with the basics and explore what version control is in general. In this entry, we will discuss problems that any version control aims to solve, where version control came from and some of the basic concepts you’ll need to know in order to use it.
A Simple Example
It’s often helpful to have a concrete example when talking about editing code, so let’s use a simple personal web page:
[cc lang=”xml”]
John Doe
A Java programmer from Chicago, IL.
About John
John is experienced in many areas of Java programming.
Contact Information
-
- Email: john@example.com
- Phone Number: (555) 555-1024
[/cc]
We’ll use this simple HTML page as an example throughout this entry.
Code Changes Often
The code we write as programmers changes often. Bugs need to be fixed, features need to be added, and content needs to be changed.
Most code is stored as plain old text files, and we change the code by editing these files. Every time we save our changes, we overwrite the old version of the file with a new one.
Unfortunately, no programmer is perfect, and sometimes, we make mistakes. If you make a change to a file, save it, compile it, and find out that something went wrong, it’s often helpful to be able to go back to the old version or to get a report of what we actually changed, in order to focus on what we may have done wrong.
Suppose in our example, our fictional character John wants to update his “Contact Information” header to read “John’s Contact Information”. He might edit the file so that that section reads:
[cc lang =”html”]
John’s Contact Information
- Email: john@example.com
- Phone Number: (555) 555-1024
[/cc]
John saves the file, reloads the page, and notices something doesn’t look quite right. How can John figure out the problem?
In this simple example, it’s fairly easy to simply read the entire file and find the problem, but it can obviously get much more difficult when you’re editing many parts of a large file that all interact with each other.
One of the earliest methods that is still around for comparing versions of files is a pair of utilities called “diff” and “patch”. Modern version control systems still use the concepts (and even file formats) of these tools, so let’s take a look at how they work.
Diff
Diff was originally created in the early 1970s. Its purpose is to take two versions of a file as input, and output a hunk of text that tells you how to change the first file into the second file.
There are many formats of diff in existence, but we’ll just work with the most common format which is know as a “unified diff”. If you’re following along at home on a Linux or OS X machine, you’ll need to pass the
[cc lang=”html”]
-U3
[/cc]
option when you run diff to get the format.
If our example user John wants to use diffs to help him visualize his changes, he’ll need to do some extra work beforehand because he needs to keep the original version of his web page. When John wants to edit his web page, he would do the following steps:
- Make a backup copy of his page by copying his index.html file to index-old.html.
- Make his changes, save them, and preview them.
- Use diff to compare index-old.html to index.html.
When he runs the files through diff, he’ll get output that looks like this:
[cc lang=”xml”]
— index-old.html 2010-10-06 19:33:31.000000000 -0400
+++ index.html 2010-10-06 19:33:50.000000000 -0400
@@ -13,7 +13,7 @@
John is experienced in many areas of Java programming.
–
Contact Information
+
John’s Contact Information
-
- Email: john@example.com
- Phone Number: (555) 555-1024
[/cc]
Notice that there are two lines displayed to represent the line he modified. The first has a ‘-‘ before it, which means: “this line was deleted”. The second has a ‘+’ before it, which means: “this line was added.”
This example demonstrates an important point about diffs: even though John only changed a few parts of the line, diff treats the entire line as added and removed. Standard diffs only deal with entire lines.
Now that John can see exactly what he changed, it’s quite easy to see the problem – he’s changed the second level header to a first level header. He can fix the problem, save his changes, and delete the index-old.html file when he’s satisfied.
The ability to easily see what changed in a file is one of the biggest advantages of using diffs, but they can also be used to transform files with the patch utility.
Patch
Patch is a utility used to read hunks of text produced by diff and apply them to files, in order to transform the old files into the new versions.
Patch is often used to share changes with other people. For example, let’s say John asks his friend Mary for some advice on his web page. Mary looks over the code and changes it a bit to make the important sections stand out. She then runs diff to produce a hunk of text describing her changes, which looks like this:
[cc lang=”xml”]
— index-old.html 2010-10-06 19:33:50.000000000 -0400
+++ index.html 2010-10-06 19:50:08.000000000 -0400
@@ -1,7 +1,7 @@
–
+
@@ -15,10 +15,10 @@
John’s Contact Information
-
- –
- Email: john@example.com
–
- Phone Number: (555) 555-1024
+
- Email: john@example.com
+
- Phone Number: (555) 555-1024
–
+
[/cc]
She saves that text to a file and emails it to John.
When John receives the file, he could simply retype all of Mary’s changes and save the file, but that’s a lot of work and he might make a typo. Instead, he could use the patch utility to “apply” the output of Mary’s diff to his own copy of the file.
When he feeds his version of the file and Mary’s diff to the patch utility with patch index.html index-from-mary.patch, it modifies his copy by replaying the changes Mary made. The result is that his copy of the file now looks like Mary’s, and he can simply upload it to his web server without any more work.
Core Concepts of Diff and Patch
Diff’s purpose can be summed up as: “representing changes to files as hunks of text”. These hunks of text can be read by a human to determine what changed, and can also be saved as files and emailed to other people.
Patch’s purpose can be summed up as: “taking hunks of text produced by diff and applying them to the old versions of files in order to transform them into the new versions.”
These utilities give us an efficient way to share changes. If one line in a 10,000-line file changes, the diff for that file is only a few bytes. If we transferred the entire file instead, the file would be 10kb. When you work with multiple files, this difference can add up quickly.
Managing Versions of a File Without Version Control
A large disadvantage of the diff and patch utilities is that they work with two, and only two, versions of a file: “old” and “new”. Rare indeed is the programming project that only ever needs a single update in its lifetime.
It is often useful to see how a code file has evolved over time. To do that, we need to store “versions” of the file, so we can compare them later. One way to do this is to save many copies of a file with some numbering scheme, like this:
[cc lang=”html”]
index.html
index-2009-04-08.html
index-2009-06-06.html
index-2009-08-10.html
index-2009-11-04.html
index-2010-01-23.html
index-2010-09-21.html
[/cc]
The disadvantages to this method are many:
-
-
- We need to save a full copy of the file even if only a single line changed.
- The numbering scheme becomes more complicated if we need to store two separate versions for the same date.
- Two people may edit the file on the same date.
- Many versions of the file are stored, which clutters our project folder.
- If our hard drive fails, the entire history of the file disappears.
- If we want to tell a coworker to “look at the changes between X and Y,” we need to send them those versions.
-
Managing Versions of a Group of Files Without Version Control
Another disadvantage of diff and patch is that they only work on a single file at a time.
In the real world, most programming projects consist of many files, which means we need to save copies of each file we change.
Worse still, changes to one file might affect other files. If we change a C header file, we almost certainly need to change the corresponding .c file, which means we need to make sure that those versions are kept together as a group.
Managing versions yourself by making copies of files quickly becomes a nightmare. Version control systems are programs designed to handle this for us, so we can stop worrying about copying files and get back to coding.
Centralized Version Control vs. Distributed Version Control
In our next entry we will discuss the advantages and disadvantages of centralized version control. Thinking about moving from Subversion to Git? Want to learn about the key concepts of DVCS? Stay tuned for the rest of our Switch To DVCS blog series.