Originally published at https://mathdatasimplified.com on July 1, 2023.
Git is a version control system widely used in software development, but is it the right choice for your data science project? Absolutely.
Here are some reasons why Git is invaluable for data science:
You replace the current data processing technique with a new approach. After realizing that the new approach is not producing the desired results, you want to revert back to a previous working version.
Unfortunately, without version control, it becomes a daunting task to undo multiple changes.
With Git, you can track changes to your codebase, switch between different versions, compare changes, and roll back to a stable state if necessary.
You collaborate with other data scientists on a machine-learning project. To merge all changes made by team members, you need to manually exchange files and review each other’s code, which takes time and effort.
Git makes it easy to merge changes, resolve conflicts, and synchronize progress, allowing you and your team members to work more efficiently together.
You want to explore new approaches to enhance your model’s performance but are hesitant to make changes directly to the production code. Any unintended impact on the deployed model could have significant consequences for your company.
With Git’s branching, you can create separate branches for different features. This allows you to test and iterate without compromising the stability of the production branch.
A hardware failure or theft results in the loss of all your code, leaving you devastated and setting you back months of work.
Git backs up your projects by securely storing them on remote repositories. Thus, even if you encounter such unfortunate events, you can restore your codebase from the remote repository and continue your work without losing significant progress.
Now that we understand the value of Git in a data science project, let’s explore how we can effectively use it in different scenarios.
To initialize Git in your current project and upload your project to a remote repository, follow these steps:
First, initialize a new Git repository in the project directory:
Next, add a remote repository to your local Git repository. To use GitHub as the remote repository, create a new repository on GitHub and copy its URL.
Then, add the URL to your local Git repository with the name “origin”:
git remote add origin <repository URL>
Next, stage changes or new files in your Git repository:
# Add all changes in the current directory
git add .
Review the list of changes to be committed:
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file: .dvc/.gitignore
new file: .dvc/config
new file: .flake8
new file: .gitignore
new file: .pre-commit-config.yaml
new file: Makefile
new file: config/main.yaml
new file: config/model/model1.yaml
new file: config/model/model2.yaml
new file: config/process/process1.yaml
new file: config/process/process2.yaml
new file: data/final/.gitkeep
new file: data/processed/.gitkeep
new file: data/raw.dvc
new file: data/raw/.gitkeep
new file: docs/.gitkeep
new file: models/.gitkeep
new file: notebooks/.gitkeep
new file: pyproject.toml
new file: src/__init__.py
new file: src/process.py
new file: src/train_model.py
new file: tests/__init__.py
new file: tests/test_process.py
new file: tests/test_train_model.py
Save the staged changes permanently in the repository’s history along with a commit message:
git commit -m 'init commit'
Once your commits are made and stored in your local repository, you can share your changes with others by pushing them to a remote repository.
# push to the "main" branch on the "origin" repository
git push origin main
After running this command, the “main” branch on the remote repository will receive the latest changes from your local repository.
Contribute to an Existing Project
To contribute to an existing project, start by creating a local copy of the remote Git repository on your local machine:
git clone <repository URL>
This command will create a new repository with the same name as the remote repository. To access the files, navigate to the repository directory:
It is a good practice to make changes on a separate branch rather than the “main” branch to avoid any impact on the main codebase.
Create and switch to a new branch using:
git checkout -b <branch-name>
Make some changes to the new branch, then add, commit, and push the changes to the new branch on the remote Git repository:
git add .
git commit -m 'print finish in process_data'
git push origin <branch-name>
After pushing the commit, you can create a pull request to merge the changes into the “main” branch.
After your colleague approves and merges your pull request, your code will be integrated into the “main” branch.
Merge Local Changes with Remote Changes
Imagine that you have created a branch called “feat-2” from the main branch. After making several changes to the “feat-2” branch, you discovered that the main branch has been updated. How do you merge the remote changes from the main branch into the local branch?
First, make sure your local work is saved by staging and committing local changes.
git add .
git commit -m 'commit-2'
This prevents the remote changes from overriding your work.
Next, pull the changes from the main branch on the remote repository using
git pull. When executing this command for the first time, you will be prompted to choose a strategy for reconciling the branches. Here are the available options:
$ git pull origin main
* branch main -> FETCH_HEAD
hint: You have divergent branches and need to specify how to reconcile them.
hint: You can do so by running one of the following commands sometime before
hint: your next pull:
hint: git config pull.rebase false # merge
hint: git config pull.rebase true # rebase
hint: git config pull.ff only # fast-forward only
hint: You can replace "git config" with "git config --global" to set a default
hint: preference for all repositories. You can also pass --rebase, --no-rebase,
hint: or --ff-only on the command line to override the configured default per
fatal: Need to specify how to reconcile divergent branches.
git pull origin main --no-rebase will create a new merge commit in the “feat-2” branch that ties together the histories of the “main” branch and the “feat-2” branch.
git pull origin main --rebase will perform a rebase operation, which places the commits from the “feat-2” branch on top of the “main” branch.
Rebase does not create new merge commits as merge does; instead, it modifies the existing commits of the “feat-2” branch. This results in a cleaner commit history.
However, the rebase command should be done with caution, particularly when other team members are actively using the same branch, such as the “feat-2” branch.
If you rebase your “feat-2” branch while others are also working on it, it can lead to inconsistencies in the branch history. Git may face difficulties when attempting to synchronize these divergent branches.
If you’re new to Git and prioritize simplicity over maintaining a clean history, use the merge approach as the default option as it is generally easier to understand and use compared to rebase.
Revert Back to the Previous Commit
Imagine this: After creating new commits, you realized that errors have been made within them and want to revert back to a specific commit. How do you do that?
Start with identifying the commit hash of the specific commit you want to revert by running:
Let’s assume you want to revert back to “commit-1”, you can either use
git revert or use
git revert creates a new commit that undoes the changes made after a specified commit.
git reset modifies the commit history by changing the branch pointer to the specified commit.
git reset keeps the commit history clean, it is more destructive since it discards commits.
git revert is a safer option as it leaves the original commits intact.
Ignore Large and Private Files
In a Git repository, it’s essential to exclude specific files or directories from version control to address issues like large file sizes and privacy concerns.
In a data science project, there are certain files you should ignore, such as datasets and secrets, for the following reasons:
- Datasets: Versioning binary datasets can significantly increase the repository’s size.
- Secrets: Data science projects often require credentials or API keys for accessing external services. Including these secrets in the codebase can pose a security risk if the repository is compromised or publicly shared.
To exclude specific files or directories, you can add them to the .gitignore file located in the root directory of your project. Here are some examples:
Additionally, you should ignore non-essential files that can contribute to large file sizes or are specific to your development environment, such as dependency management files like “venv” or editor-specific files like “.vscode”.
Find a list of useful
.gitignore templates for your language here.
Break down your changes into small, focused commits. This approach ensures that each commit has a clear purpose, making it easier to understand, revert changes if needed, and minimizes the chances of conflicts.
Opt for descriptive branch names that accurately reflect the task or feature you’re working on. Avoid vague names like “add file” or personal identifiers like “john-branch.” Instead, choose more descriptive names such as “change-linear-model-to-tree-model” or “encode-categorical-columns.”
Standardize Code Format for Easier Code Review
Consistent code formatting helps reviewers focus on the logic of the code rather than formatting inconsistencies.
In the example code snippet below, it is challenging for reviewers to pinpoint the addition of the print statement due to irregular indentation, spacing, and quotation marks.