An important best practice for scientific software developers is to focus on making incremental changes. In this section, we'll focus on the basic techniques for saving your work using Git. In this lesson, you'll learn a third use for the checkout command, how to work with development branches, and then how to save revisions in the staging area.
After this lesson, students should be able to:
git checkout and
git branch.git status
and git diff.git add, git
reset, and git rm.git commit.In the last lesson we cloned a repository and explored its history. We also learned about HEAD, a special pointer to the revision of our project that we most recently checked out. We call lightweight pointers to revisions in Git branches. This is confusing at first, because many newcomers to Git think of branches as part of the structure of the repository's history, instead of pointers to it.
The best way to think of branches is to first imagine your repository as a notebook shared by several collaborators in your laboratory, with bookmarks (the branches) on the page of their most recent experiment. Every time somebody performs an experiment, they move their bookmark to the page of the newest experiment. Since laboratory paper is cheap, researchers take notes of experiments that might fail, and reset their notes whenever they don't like an outcome. Any time somebody wants to keep track of one of their collaborators, they just need to look up their bookmarks. Since bookmarks are cheap, some of the scientists track multiple experiments, even risky ones that might not pan out, with their bookmarks. The bookmarks themselves are not the entries in the notebook, but they are a handy way to keep track of active work.
In Git, a branch is a reference that follows your development. When you create new revisions, your current branch will automatically move to new commits you make. Branches are also primary tools for sharing and collaboration, as they allow you to refer others to work-in-progress (and scientific software development is always work-in-progress).
When a repository is created, Git creates a branch called master.
Unlike HEAD, the only thing special about master is that it's the
default name given to this branch.  It's generally not a great idea to
work directly on master, so we'll start this lesson by creating a new
branch from it for development.
By default, Git creates branches that point to the currently
checked-out revision.  We would like to start our development at the
most up-to-date version of the repository, so we check out our local
copy of the master branch first.  This checkout command performs
two actions.  First, it checks out the repository to the revision
pointed to by the master branch, in this case: 61fd2bc.  Second,
it activates the master branch.  Keep in mind, Git will
automatically move the active branch pointer for you when you make
commits.
$ git checkout master
Switched to branch 'master'
We can create and activate a branch in the same step by adding the
-b flag to git checkout.  Since later we'll be fixing an error
in the  file python_pipeline.ipy, we'll name the branch
fix_pipeline.
$ git checkout -b pipeline_fix
Switched to a new branch 'pipeline_fix'
We can see the current branches available with git branch, but this
command is much more useful if we add the -v flag (for verbose).
Note that git branch lists the local branches in our repository,
master and pipeline_fix, which we just created.  The * next to
pipeline_fix indicates that it is active.
$ git branch -v
  master       61fd2bc Made fixes to Python pipeline
* pipeline_fix 61fd2bc Made fixes to Python pipeline
Now that we've created our development branch, let's start preparing some changes to our repository.
master and pipeline_fix point to the same
revision even though they are different branches?--decorate argument to git log will annotate your
history view with the location of your current branches.  Try
calling git log --oneline --decorate.  Discuss with your partner
what you think origin/master refers to.DETACHED HEAD state occurs when the repository does
not have an active branch, frequently after checking out a commit or
remote reference.  Try calling the command git checkout 61fd and
explaining the output message to your partner.git branch -D to delete a branch that is not
active.  Remember that this only deletes your local reference to a
commit.  Try deleting the master branch in your repository by
calling git branch -D master.In the previous lesson, we used git checkout to undo some changes
we had made to a file.  We also learned about diff output, which is
a common way for computers to tell us how two text files differ.  In
this section, we'll learn about the git status and git diff
commands, which help us see changes in our repository.  We'll also learn
how to use the git show command to retrieve a copy of a file from
anywhere in our repository's history.
Let's start by repeating our dangerous idea from the previous lesson, and delete an important data file from the repository!
$ rm Lumi.2763.csv
$ ls Lumi.2763.csv                                                                            ✖
ls: Lumi.2763.csv: No such file or directory
Now that the file is gone, we'd like to ask Git if it was paying
attention and knows something is missing.  We're going to do that now
with one of the most powerful tools in the Git toolbox, the git
status command.
$ git status
On branch pipeline_fix
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)
	deleted:    Lumi.2763.csv
no changes added to commit (use "git add" and/or "git commit -a")
Whew, that's a lot of output.  Let's take it line by line.  First,
the git status command reports that we're on the pipeline_fix
branch (if it doesn't, and you don't understand why, re-read the
previous section).  Second, Git is reporting that you have made some
changes, but these are not yet saved.  Git refers to your files as the
working directory, or sometimes as the working tree.  If you see
working in the name, be confident that Git is referring to the files
you can see in front of you with commands like ls.  Finally, Git
summarizes the changes.  In this case, since a file was completely
deleted, it reports that Lumi.2763.csv has been deleted.  Let's
ignore the deleted file for now, and make some changes to
python_pipeline.ipy.  I encourage you to make real changes with your
editor, but for the purposes of this example, I'm going to append a
line of text to the end of python_pipeline.ipy from the command line.
$ echo "# This is a comment." >> python_pipeline.ipy
We call git status to see the effects of modifying
python_pipeline.ipy.
$ git status
On branch pipeline_fix
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)
	deleted:    Lumi.2763.csv
	modified:   python_pipeline.ipy
no changes added to commit (use "git add" and/or "git commit -a")
Okay!  Now we're going to show you one other type of change you might
make to a repository, creating a new file.  Let's call this file,
documentation.txt, and let's put the line: TODO in by itself.
Again, please use your favorite editor for this, here's a command line
script for the purposes of this example.
echo "TODO" > documentation.txt
We're going to call git status one more time.
$ git status
On branch pipeline_fix
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)
	deleted:    Lumi.2763.csv
	modified:   python_pipeline.ipy
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	documentation.txt
no changes added to commit (use "git add" and/or "git commit -a")
When we deleted and modified files already in the repository, git
status reported these as unstaged changes.  New files, however, are
untracked.  The difference is subtle.  New files have no Git
history, so Git ignores them by default (and tells you it is doing
so).
Sometimes it isn't enough to see that files have changed.  We want to
know how they changed!  The git diff command provides a report of
the differences in the working directory:
$ git diff
diff --git a/Lumi.2763.csv b/Lumi.2763.csv
deleted file mode 100755
index fb57a5b..0000000
--- a/Lumi.2763.csv
+++ /dev/null
@@ -1,10 +0,0 @@
-    ,  1   ,  2   ,  3   ,  4   ,  5   ,  6   ,  7   ,  8   ,  9   ,  10  ,  11  ,  12
-  A , 8053 , 9397 ,13857 ,  886 ,  985 , 1204 ,34036 ,34673 ,40062 , 1354 ,  944 , 1014
-  B , 6538 , 7707 , 7419 , 1098 , 1222 , 1221 ,30437 ,32832 ,31382 , 1217 ,  987 ,  967
-  C , 4734 , 4971 , 4376 , 1205 , 1386 , 1369 ,26260 ,24249 ,21131 , 1101 ,  871 ,  826
-  D , 3785 , 2310 , 3154 , 1088 , 1085 , 1038 ,14243 ,12519 ,13091 ,  699 ,  822 ,  772
-  E , 2538 , 1728 , 2338 , 1063 , 1182 , 1202 ,10260 , 6621 , 6756 ,  922 ,  857 ,  731
-  F , 1453 , 1128 , 1255 , 1022 ,  951 , 1302 , 2404 , 2900 , 2161 ,  760 ,  640 ,  617
-  G , 1409 , 1010 , 1080 , 1019 , 1309 ,  975 , 1943 , 1627 , 1495 ,  737 ,  701 ,  696
-  H , 1476 , 1025 , 1164 , 1120 , 1312 , 1310 , 1916 , 1559 , 1711 , 1040 ,  888 ,  927
-
diff --git a/python_pipeline.ipy b/python_pipeline.ipy
index 2ecc443..1c43ea6 100644
--- a/python_pipeline.ipy
+++ b/python_pipeline.ipy
@@ -43,3 +43,4 @@ colorbar()
 norm_data = h/(bgal_data-0.07)
 imshow(norm_data)
+# This is a comment.
We discussed diff output in the previous lesson.  Here git diff
is showing us the changes in our
current working directory.  Again, this is a lot of output, but for
the most part, we only pay attention to the lines that begin with a
- or a +.  We see that all of the lines in Lumi.2763.csv have
been deleted, which is what we would expect for a deleted file.  In
python_pipeline.ipy, our line #This is a comment. has been added.
The content of our new file is not shown.  Why is that?  Git assumes
that you only want to see the changes to files you've explicitly told
it about.  How do you tell Git that you'd like to track a new file?
We do this with the git add command, which we'll cover in the next
section.
git status and git diff commands take a few useful
flags.  Try calling git status -s and git diff --stat to see
terse summaries of the status of the repository and changes that
have been made.git show command provides this
capability, although the syntax is slightly awkward.  Try calling
git show HEAD:Lumi.2763.csv and git show master:Lumi.2763.csv,
then explain the difference between the two revision parameters,
master, and HEAD.  Consult
the man page for git revisions
for some gory details.So far, we have made three changes to the working directory. We deleted a file, we modified a file, and we created a new one. Let's suppose that we want to leave the deleted file alone for now, and save the other changes to our history. Since we frequently make changes across many of our files while we are working, but are only interested in saving some of these changes into each revision of the repository, Git introduces the concept of staging.
To communicate which changes we would like to save, we put them in the stage. Think of the stage like a set at a camera studio. We can add and remove files from the stage in preparation for saving our changes as a revision in history. Everything on the stage will be in the picture when it's taken. Keep in mind that Git-speak for "create a new revision with these changes I have staged" is commit. The word "commit" is shamelessly used both as a verb for referring to the process of creating revisions, and as a noun for referring to a specific revision.
Hint: You may see the words index and cache in the documentation or while reading about Git. These terms both refer to the stage!
Let's get our keyboards dirty!  To add files to the stage, we use the
git add command.  Let's try that now with our
untracked-until-right-now documentation.txt.
$ git add documentation.txt
Let's see what's changed in git status:
$ git status
On branch pipeline_fix
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)
	new file:   documentation.txt
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)
	deleted:    Lumi.2763.csv
	modified:   python_pipeline.ipy
documentation.txt has graduated from an untracked file to a change
ready to be saved in our next revision!  We can also use git add to
stage changes to modified files.
$ git add python_pipeline.ipy
$ git status
On branch pipeline_fix
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)
	new file:   documentation.txt
	modified:   python_pipeline.ipy
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)
	deleted:    Lumi.2763.csv
Let's see how this affected the output of git diff.
$ git diff
diff --git a/Lumi.2763.csv b/Lumi.2763.csv
deleted file mode 100755
index fb57a5b..0000000
--- a/Lumi.2763.csv
+++ /dev/null
@@ -1,10 +0,0 @@
-    ,  1   ,  2   ,  3   ,  4   ,  5   ,  6   ,  7   ,  8   ,  9   ,  10  ,  11  ,  12
-  A , 8053 , 9397 ,13857 ,  886 ,  985 , 1204 ,34036 ,34673 ,40062 , 1354 ,  944 , 1014
-  B , 6538 , 7707 , 7419 , 1098 , 1222 , 1221 ,30437 ,32832 ,31382 , 1217 ,  987 ,  967
-  C , 4734 , 4971 , 4376 , 1205 , 1386 , 1369 ,26260 ,24249 ,21131 , 1101 ,  871 ,  826
-  D , 3785 , 2310 , 3154 , 1088 , 1085 , 1038 ,14243 ,12519 ,13091 ,  699 ,  822 ,  772
-  E , 2538 , 1728 , 2338 , 1063 , 1182 , 1202 ,10260 , 6621 , 6756 ,  922 ,  857 ,  731
-  F , 1453 , 1128 , 1255 , 1022 ,  951 , 1302 , 2404 , 2900 , 2161 ,  760 ,  640 ,  617
-  G , 1409 , 1010 , 1080 , 1019 , 1309 ,  975 , 1943 , 1627 , 1495 ,  737 ,  701 ,  696
-  H , 1476 , 1025 , 1164 , 1120 , 1312 , 1310 , 1916 , 1559 , 1711 , 1040 ,  888 ,  927
-
Why do our staged changes no longer appear in the output of git
diff?  The answer to this question is subtle.  When we change
revisions using git checkout, Git sets the contents of our working
directory and the staging area to the content of the revision.  When
we call git diff, it is comparing our working directory to the
staging area.  If we want to see how our staged changes will look in
history after they've been commited, we add the --staged flag to the
git diff command.
$ git diff --staged
diff --git a/documentation.txt b/documentation.txt
new file mode 100644
index 0000000..1333ed7
--- /dev/null
+++ b/documentation.txt
@@ -0,0 +1 @@
+TODO
diff --git a/python_pipeline.ipy b/python_pipeline.ipy
index 2ecc443..1c43ea6 100644
--- a/python_pipeline.ipy
+++ b/python_pipeline.ipy
@@ -43,3 +43,4 @@ colorbar()
 norm_data = h/(bgal_data-0.07)
 imshow(norm_data)
+# This is a comment.
Sometimes we make a mistake and stage some changes we didn't mean to.
It would be easy to remember how to undo changes if Git just used the
commands stage and unstage.  Unfortunately, Linus Torvalds gave us
the confusing pair: add and reset.  If you think of the
photography studio metaphor, then resetting the stage to what it
looked like in the last photo kind of works, but it's still confusing.
Let's reset the staged version of python_pipeline.ipy.  This command
only affects the staging area, so the file will remain changed, but
the changes will go from staged to unstaged:
$ git reset python_pipeline.ipy
Unstaged changes after reset:
D	Lumi.2763.csv
M	python_pipeline.ipy
$ git status
On branch pipeline_fix
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)
	new file:   documentation.txt
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)
	deleted:    Lumi.2763.csv
	modified:   python_pipeline.ipy
-- flag is used to separate filenames and paths from
the other parameters in a git command line argument.  What is the
output of git diff -- python_pipeline.ipy?  What other commands
take a single filename or set of files as an argument?git add command.  Try using the command git rm Lumi.2763.csv.
To unstage this removal, the appropriate command is git reset --
Lumi.2763.csv.  How would you restore the last revision of
Lumi.2763.csv from the repository's history?git add,
unstaging them with git reset, deleting files with git rm,
restoring them with git checkout, and visualizing the staging area
with git status and git diff.git commit using the nano editor.git commit.git log --decorate to see revisions, how they differ.git reset or by
deleting them and recreating them in the appropriate positions.