How Git works

If you don't have the time or motivation to read the official docs, you're at the right place. This guide is not meant to replace the official docs or the Git Book. Do refer to those whenever something did still not become clear or you need more details. Rather, we will focus on the fundamentals, the things that based on experience are actually useful to know but often not known. Actually understanding what you're doing will help you avoid common mistakes or inefficient workflows based on false assumptions and wrong mental models.

Repository and Working Tree

When you create or clone a Git repository, you end up with a folder that contains your source code (or other) files and a folder named .git, which is often hidden by default on some operating systems.

The folder containing your project's source files is called the working tree, because that's where you do your work and edit your files.

The .git folder is called a repository. It contains a database of versions of your project and the different versions of the source files associated with each of those project versions, as well as some other important data.

We will look at the anatomy of a Git repository in more detail later on, but for now we just want to clarify the difference between the repository and the working tree. Sometimes, colloquially, people may say repository when they refer to the working tree, or to both working tree and repository together. This is fine, but it's still important to know the difference even when we're less precise in informal conversations. The repository does not contain a .git folder, the .git folder is the repository. The working tree is not the repository, it resides next to the repository, and it is basically just the representation of your files from the version you're currently working on.

Versions, revisions, commits

Each time we want to "save our game", i.e., the progress in our text or source code files, Git creates a new version of the project in a database inside the repository. These versions are usually referred to as revisions or commits, which are created by committing a new version to the repository. Each commit is identified by a unique ID and references its parent via the parent's ID, but the references between commits are one-directional. Parent commits do not reference their children.

You can conceptually imagine it like this:

ID:      7d27fb7 <-- f45409d <-- ac47640
Parent:   [none]    [7d27fb7]   [f45409d]

Commits can have 1 parent, 0 parents (orphan commits, like 7d27fb7 in the example above) or arbitrarily many (merge commits, covered later in more detail), but in practice most commits have 1 parent and most merge commits have 2. Because the reference to a commit's parent(s) is one-directional, and because commits are immutable (and thus can only reference parent commits that already existed when they were created), the commits form a directed acyclic graph (DAG), usually referred to as the commit graph.

Commits are snapshots

There's a common misconception that commits represent just the changes since the previous version, but although Git uses some tricks to avoid storing unchanged files redundantly, each commit still references all of its files, and conceptually it's a snapshot of the whole project at that point in time.

Commit IDs

Commits are uniquely identified by the SHA-1 hash of its contents and metadata. Commits are immutable, and any Git operation that "changes" commits actually creates a copy to which the changes are applied, and the copy gets its own ID.

Here's what a commit ID might look like:

6fb41fb6820fd9afce0336b7c82386625095f4cc

When you need to identify a commit by ID, it is usually enough to specify just the first few hex digits as long as it uniquely matches one of the commits. The minimum is 4 digits, and when multiple commits' IDs begin with those digits, Git prompts to provide additional digits to be more specific. Many tools display the initial 7 digits as a short-form commit ID.

SHA-1 hashes are unique in practice

SHA-1 is not considered cryptographically secure anymore, but the likelihood of ever getting a hash collision by accident is still mathematically negligible and not a concern in practice.

Understanding Refs

In the previous example we've identified commits by their IDs, but often it's easier to refer to them by using more semantic names. There are different kinds of refs ("references"), including branches, tags, and a few others, that can be used as aliases for a commit, or for different commits at different times. The most important refs will be explained below. What all of them have in common is that they point at a commit, and they have a name that can be used as an alias for that commit.

7d27fb7 <-- f45409d <-- ac47640
    \                       \
     my_ref                  my_other_ref

Often refs point directly at commits, but some refs can point at other refs and may transitively reference a commit. Here, both my_ref and my_ref_2 point at 7d27fb7, while my_indirect_ref transitively points at ac47640.

7d27fb7 <-- f45409d <-- ac47640
    \                       \
     my_ref, my_ref_2        my_other_ref
                              \
                               my_indirect_ref

The object database containing commits, files, and a virtual representation of the file system, combined with refs being created and moved around on the commit graph, are what makes the repository work the way it does.

How refs are stored

The way Git keeps track of refs is very simple.

% cat .git/refs/heads/my-branch
56151741b29574580e5a5533597b4197572bcc58

In the .git folder, there's a folder named refs which contains simple text files (some of them in further sub-folders). The file's name is the ref's name, and the content is the referenced commit's ID (or, in some cases, another ref). You'll probably never need to manually change or even look at these, but it can help to understand how refs actually work.

HEAD

When you work in a Git repository, you base your next changes on one of the existing versions, often the most recent one. But how does Git know at which commit we currently are? This is where the HEAD comes in. The HEAD is a ref that points at our current commit, like a cursor. When we switch to an older version, or we add a new commit, the HEAD moves to that commit. As we move around on the commit graph, the working tree is updated to contain the files of the commit referenced by the HEAD.

Conceptually, you can imagine it like this:

7d27fb7 <-- f45409d <-- ac47640
                            \
                             HEAD

If we now add a new commit, it will be made to point at the commit currently referenced by HEAD (ac47640 in this case), and then HEAD will automatically be moved to point at the new commit da6d8a7:

7d27fb7 <-- f45409d <-- ac47640 <-- da6d8a7
                                        \
                                         HEAD

We can move around on the commit graph, switching to any version we want. git switch --detach f45409d moves the HEAD to f45409d and checks out all of the associated files into the working tree (the --detach option is required here for reasons that will be explained later):

7d27fb7 <-- f45409d <-- ac47640 <-- da6d8a7
                \
                 HEAD

!!! info "switch or checkout?

`git checkout f45409d` is mostly equivalent to `git switch --detach f45409d`, but an older alternative. `git switch` has been added in 2019 because `checkout` was overloaded with different functionality in different contexts, confusing for beginners, and more prone to mistakes even by more experienced users.

In this guide, we will use the more modern, intended replacement for switching branches since it provides a better UX (if you're not already used to the old way) and additional safety. For example, with `git switch`, detaching the HEAD is less likely to happen accidentally and more explicit as an intentional action, among other things.

If you're an experienced Git user and already used to the old `git checkout`, you can keep using it, although you might want to read up on the differences and advantages. If you're new to Git, you will learn the modern, intended use from the beginning. If you're quite new and have learned about `git checkout` relatively recently, you might want to use this opportunity to create a new habit while the muscle memory is still flexible.

Branches

A branch is a line of development where one or more developers develop the codebase in a certain direction by adding commits on top of each other's changes. When working with multiple branches, they can "grow" in separate directions, resulting in versions that don't share the same history; for example, imagine part of the team prototyping a new feature by building upon a certain version, but separate from the main development.

Each branch is identified by a branch head (a kind of ref) pointing at the tip of the branch and includes the referenced commit and all of its ancestors (transitive parents), forming a sub-graph within the whole commit graph. When working on a branch, HEAD points at the current branch's head (and thus indirectly at the commit at the tip of the branch).

In the following illustration, branch_1 includes the commits 7d27fb7, f45409d, and ac47640, whereas branch_2 includes the commits 7d27fb7, f45409d, and 2443d64. They share part of the history, and they branched off at their common parent f45409d.

                       ,-- ac47640
                      /        \
7d27fb7 <-- f45409d <-          branch_1
                      \
                       `-- 2443d64
                               \
                                branch_2

When working on a branch, our HEAD points to the head of that branch.

                       ,-- ac47640
                      /        \
7d27fb7 <-- f45409d <-          branch_1
                      \          \
                       \          HEAD
                        \
                         `-- 2443d64
                                 \
                                  branch_2

When adding a commit to the current branch, the branch and (transitively) the HEAD are moved to point at the new commit, as if the HEAD were attached to that branch.

                       ,-- ac47640 <-- da6d8a7
                      /                    \
7d27fb7 <-- f45409d <-                      branch_1
                      \                      \
                       `-- 2443d64            HEAD
                               \
                                branch_2

When resetting a branch, the branch and (transitively) the HEAD are moved to point at whichever commit you want to reset the branch head to. If this causes any commits to not be (transitively) referenced by any ref anymore, it becomes a dangling commit. It will be garbage collected at some point in the future, but until then, it still exists in the repository and is still part of the commit graph.

                       ,-- ac47640 <-- (da6d8a7)
                      /        \
7d27fb7 <-- f45409d <-          branch_1
                      \          \
                       \          HEAD
                        \
                         `-- 2443d64
                                 \
                                  branch_2

Accidentally lost some commits?

If you accidentally reset a branch to the wrong commit and "lose" some commits that you actually wanted to keep, they are not actually gone, at least not immediately, and you can simply reset the branch back as long as you know the correct commit ID. We will cover later how we can find out the commit IDs of dangling commits.

When switching to a different branch, the HEAD is moved there so you can work on that branch instead. As you can see, there are operations that move the current branch along with the HEAD (e.g., commit, reset) and operations that move the HEAD alone (e.g., switch, checkout).

                       ,-- ac47640 <-- da6d8a7
                      /                    \
7d27fb7 <-- f45409d <-                      branch_1
                      \
                       `-- 2443d64
                               \
                                branch_2
                                 \
                                  HEAD

If you just want to switch to some specific version (let's say a colleague asked you if you can reproduce a bug in commit ac47640), there's no reason to create a branch head at that commit. Instead, we simply git switch --detach ac47640, thus moving the HEAD to that commit. There's rarely a reason to create a branch unless you intend to actively commit changes to it. If we just want to check something in version ac47640 and then continue working on branch_1, we simply switch to ac47640, and once we're done, we switch back to our branch.

                       ,-- ac47640 <-- da6d8a7
                      /        \           \
7d27fb7 <-- f45409d <-          HEAD        branch_1
                      \
                       `-- 2443d64
                               \
                                branch_2

Detached HEAD

When you switch to a commit directly, meaning you're not on a branch at that moment, Git tells you that your HEAD is detached.

% git status
HEAD detached at 7d56e31

If you don't know what that means, it may sound like some error state. When you search the web for "Git detached HEAD", you probably get many results about how to "recover" from a detached HEAD. However, it's not some error state you need to "recover" from (although in some cases it might be accidental, it is not an error itself). The HEAD can simply either be attached to a branch head (for when you actively commit to that branch), or detached from any branches (when you move around on the commit graph switching to specific commits). When you intend to actively add commits again, you simply create a new branch at the HEAD or switch back to an existing branch.

We git switch --detach <hash or ref> to specific commits to move around on the commit graph, and we git switch <branch> to attach to a branch when we intend to add commits to it.

A word on branching

Overusing branches can have some serious, but counter-intuitive implications on the quality and efficiency of your software development, depending on whether you work on a team or in open source. Old habits are hard to overcome, especially in big organizations. Avoid simulating an open source project when you're actually working in a team of collaborating trusted professionals. More on that later. For now, just know that just because Git makes branching easy, it doesn't mean that you shouldn't avoid it when you can.

Remote-tracking Branches

When working with others, you would often have a shared repository hosted somewhere, like on a company network using an open-source solution or on some hosting service like GitLab, GitHub, Codeberg, or one of many alternative providers. That shared repository is just a normal repository (although often without a working tree for faster IO) and is configured in your local repository as a remote (usually named origin by convention).

There can be multiple remotes

Although usually you would have just one, you can add arbitrarily many remotes. Good reasons for having multiple remotes include:

Working with a fork, where you fetch from the original remote and push to your fork remote.
Migrating a repository from one hosting service to another, where you would fetch/clone the commit graph from the remote at hosting service A and then push the whole commit graph to an empty repository at hosting service B.
Adding the commit history of an internal tool into the project's main repository, where you fetch from one remote and then push to the other, merging both histories together.

When we fetch (git fetch) from a remote, git "downloads" the parts of the commit graph that are not already present locally, so that afterwards we have all commits available. Remote-tracking branches, another kind of refs, are created (or moved) to point at the same commits as the counterparts in the remote.

In this example, the remote-tracking branch origin/main points at ac47640, because that's where main in the remote repository pointed at the last time we fetched.

Remote:

7d27fb7 <-- f45409d <-- ac47640 <-- da6d8a7
                                        \
                                         main

Local (before fetching):

7d27fb7 <-- f45409d <-- ac47640
                            \
                             main, origin/main

Remote-tracking branches are just local refs automatically managed by Git

Like the other kinds of refs we've looked at before, remote-tracking branches are stored in simple text files. Unlike local branches, they are automatically updated by Git, mostly when we fetch.

% cat .git/refs/remotes/origin/main
3cd3a74211f3ca32543bcec267192151628b781a

When we fetch, Git downloads the commit da6d8a7 (and all of the associated file versions) into the object database of the local repository and moves the remote-tracking branch to point at da6d8a7, because that's where main in the remote points at.

Local (after fetching):

7d27fb7 <-- f45409d <-- ac47640 <-- da6d8a7
                            \           \
                             main        origin/main

Now we can, for example, switch to the newly fetched commit (git switch --detach origin/main).

7d27fb7 <-- f45409d <-- ac47640 <-- da6d8a7
                            \           \
                             main        HEAD, origin/main

We can git show the diff and metadata of the newly fetched commit (git show origin/main).

 % git show origin/main
commit da6d8a7259d8dbd468724886cd48e8bc2b359172 (origin/main)
Author: Bob

    Hello world

diff --git a/hello.txt b/hello.txt
new file mode 100644
index 0000000..a5c1966
--- /dev/null
+++ b/hello.txt
@@ -0,0 +1 @@
+Hello, world

Or we can fast-forward merge it with our local main, which in this case is equivalent to resetting the local main branch head to the commit referenced by origin/main (assuming we don't have any uncommitted local changes). We will cover different ways of merging later on.

Local (after `git merge --ff-only origin/main`):

7d27fb7 <-- f45409d <-- ac47640 <-- da6d8a7
                                        \
                                         main, origin/main

Remote-tracking branches are not remote branches

Remote-tracking branches track remote branches. origin/main is NOT the remote branch itself, it's just a ref that exists locally.

When misunderstanding this difference, out of fear of accidentally changing something directly in the remote repository, many developers avoid working with remote-tracking branches and instead always update a local branch to the same version as the remote-tracking branch. This not only causes inefficient extra steps, but by manually managing a branch that they never intend to actually commit to (basically tracking a remote-tracking branch which already tracks the remote branch) it also opens them up to even more potential mistakes in the process, like committing and pushing to the wrong branch.

It's not possible to directly commit a change to a remote branch. The only way to make changes to a remote is by pushing, which we will look into later. It's safe and completely normal to use remote-tracking branches (like origin/main) as aliases for the commits they point at.

Reflog

A Git repository doesn't forget. Let's say you add a commit to your branch. You then switch to another branch. You reset the branch to a different version, dropping some commits. All of these changes to refs are logged by Git in the reflog. This can be a real life saver!

Let's say I just created two commits, eed2c23 and dc3613b. I then realized the second one isn't actually needed and I want to drop it from my branch by resetting the branch to an earlier commit. However, I mistype the command and accidentally drop both commits.

Here's what git reflog main tells me:

% git reflog main
e8325c4 (HEAD -> main) main@{0}: reset: moving to e8325c4
dc3613b main@{1}: commit: Create another example
eed2c23 main@{2}: commit: Create an example
e8325c4 (HEAD -> main) main@{3}: commit: Hello, world

The reflog shows exactly what caused main to change. Three commits added to main, then main resetted to e8325c4.

Looking at the graph, we can see that eed2c23 and dc3613b got dropped from main.

% git log --graph --oneline
* e8325c4 (HEAD -> main) Hello, world

Here's the thing: Those two commits still exist. They are now dangling commits in the repository, unreachable by any refs, waiting to be garbage collected. But as long as we know the ID, we can still create a new ref (e.g., a tag) or move one onto them (e.g., reset a branch head). The reflog helps us find out the commit IDs.

In this case, we see from the reflog that dc3613b is the commit we intended to drop, and eed2c23 is the one we accidentally dropped. We can now use these IDs to get additional information.

% git log --graph --oneline dc3613b
* dc3613b Create another example
* eed2c23 Create an example
* e8325c4 (HEAD -> main) Hello, world

Once we confirmed what we intended to do, we can reset main to eed2c23.

% git reset --hard eed2c23
HEAD is now at eed2c23 Create an example

We will cover reflog in more detail later on. There's a lot more we can do with it. For now, just remember that it can help us recover from a lot of accidents, and although with good habits and practices you won't need to use it much on a daily basis, it's always good to at least know that it's there just in case. Even though it's relatively little known, and some might think of it as some advanced Git feature, in this guide we consider it a fundamental feature of Git that every developer should know about because of the additional version control safety and protection against damage from honest mistakes.

Do not panic!

When you made a mistake, resist the urge to try random commands from web searches or StackOverflow or even from LLMs. Chances are you'll make things worse.

Calmly consult the reflog, use git show <ID>, git log <ID>, and git switch --detach <ID> to move HEAD around and understand the situation before resetting any branches.

When in doubt, if possible ask a developer who's experienced with Git. Don't be like the person in the XKCD cartoon. The worst you can do is deleting your repository and creating a fresh clone. It will simply wipe out all reflog data along with all uncommitted and unpushed work.