Software Engineering
The most valuable skill in your PhD is going to be good software engineering etiquette. In this part of the handbook, we'll introduce the tools to help you strengthen your programming game.
Version Control
**Version control is just a way to keep track of changes on your files over time. Git is the most popular version control software. With Git, you can put all of your code in this container called a repository --- just a fancy word for directory --- and any change you make in that directory can be saved with an associated time-stamp. Therefore if you start changing code that was working, and somewhere along the way it stopped working, you can always revert back to the working version without any issue.
Keeping a copy of your history isn't the only value of version control software. In fact, its greater strength is its ability to enable concurrent collaboration on the same code. Two users can have the code on their computer, making separate changes, and git provides a safe process for merging the two different version together.
Without getting into all the weeds, the fundamental action that you can take in Git is to commit small edits to your code. If I change one or two lines in a big script, I can commit those changes to the repository. Often you'll make commits on your own branch of the repository. This is just a copy of the code that is typically worked on by one user. Therefore, I might have my "john-branch" to which I commit my changes, and my peer might have a "alison-branch" and she'll commit her changes to her branch. All the while, there is something called the main branch which is the official version of the code, and when we are done developing our respective changes, we'll want to merge our changes onto the main branch.
To do this, you should rebase your local branch, onto the main branch. Rebasing is just applying each commit you've pushed to your branch, sequentially onto the main branch. Once you've done this, you'll have a version of the main branch on your computer that includes all of the new edits that you've supplied, and your job is to push your main branch up to the cloud on a platform called GitHub. GitHub stores these shared repositories in accessible web-space, so your colleagues can pull the latests down from the cloud to update their version, and apply their changes to the latest version of main.
Etiquette on Version Control
It's important to make sure that your commits never get to big, and that they are always trying to fix or improve one component of your software. This is because people will often review your commits, and want to understand your changes efficiently, so they know how those changes might impact their development. If your commits contain 10 different fixes scattered throughout a big code base, then it becomes very difficult to understand what was happening from an external perspective --- or from an internal perspective many weeks / months later. Therefore it's recommended to commit small, precise pieces of code.
That said, this isn't always easy to do. You might be trying out a new feature and making many mistakes but wanting to save your progress. You can still do that! You can commit to your heart's content, but before pushing those commits, it's your responsibility to clean up your commits through interactive rebasing. Interactive rebasing is the process of reordering, dropping, merging, or renaming commits.
Debugging
Beyond keeping your code saved and organized for collaboration, it's also extremely important that you are using your development environment to its fullest potential. Integrated Development Environments or IDEs like VS Code offer an enormous amount of functionality for writing and debugging code. Among the most important tools to become familiar with is the debugger. Debuggers allow you to run you scripts line by line, where you can inspect OR MODIFY values on-the-fly. If you have some funky logic or syntax that never seems to work, you can just set a breakpoint before the error gets thrown, and try different version of the command directly in the debug console until you find the right syntax to accomplish your goal. Exclusively using print statements to debug your code will no longer be tolerated!
Linters and Formatters
One of the core goals when writing good code is to minimize cognitive load. Writing good code is hard enough as is, so we want to minimize as many distractions as possible. This includes making sure that our code is styled and formatted consistently. If some code is written with CamelCase and other with snake_case, and some lines are long and others are short, tabs and spacing are inconsistent, etc etc, then you are inadvertently brain cycles trying translate between these many different styles. Formatters and linters eliminate this entirely.
Formatters like Black have very strict rules about how your code be styled. These formatting styles are specifically designed to minimize cognitive load --- even getting down to setting the line length to <88 characters (as there is evidence to suggest this is the longest a line can get before your brain has to work extra hard). Everyone who writes code with Black will always have the same style, so when I look at a big repository, my code wont look different from your code. Linters do the same thing, but are a little more generous than formatters. Linters make recommendations on how to better construct functions, imports, etc to increase legibility.
Remote Development
Another critical element of software engineering is remote development. Our little laptops can only do so much. Some of the machine learning tasks we need to run can require hundreds of cores and industrial grade GPUs in order to complete in a reasonable amount of time. Remote development allows us to tunnel into other computers via secure shell or SSH and use their beefier resources instead. Long ago this used to be a very painful process, where everything had to be done via terminal; however modern day tools and IDEs like VS Code make remote development nearly as straightforward as local development. In particular, VS code enables you to SSH directly into the remote computer and still develop within the IDE, so you get infinitely more powerful resources from the comfort of your laptop.
That said, remote development can carry extra complexity in that you now have files split between both your computer (the host) and the remote computer. To synchronize across these two systems, you can use tools like rsync which remotely transfer data between the two machines to maintain the same state. In addition to file synchronization, remote development can also be difficult when you require visualization and access to graphical user interfaces, or GUIs, as the remote system may not have a monitor to display the visualization. There are two fixes to this problem: 1) either save the visualization to a file or 2) install an X11 server on the remote machine, which tricks the computer into thinking it has a monitor
X11 Server
Admittedly, I've struggled to get X11 to consistently work, so take this with a grain of salt.
High Performance Computing
Notably, the campus also has a high-performance computing (HPC) center called Zaratan. Development for HPC systems is more complex than remote development because there is a queue. Many groups across campus need to use the supercomputer, so you'll need to learn about SLURM, the Simple Linux Utility Resource Manager. SLURM is effectively a set of bash scripts with keywords at the top that specify the exact resources that you need (how many cores, for how long, do certain scripts have to run first, etc), and once submitted, the compute job will be put into a queue and run in the future. Your priority in the queue is a very complex equation that balances how much you've used the computer before, how big the expected job is, and the current and projected availability of the resources. In our group, you'll be doing most of your prototyping on a small HPC system specific to our group, but when the jobs grow large enough, you'll have to pump them over to Zaratan.
Virtual Environments
When developing across many different projects, each with their own dependeices, its important not to accidentally override any packages or versions. In Python, this risk can be mitigated using virtual environments. Virtual environments effectively used to install many instances of python, each with their own set of libraries and depencies. You can toggle between these virtual environments, installing different versions of the same package (or even version of python itself), without the risk over accidentally overriding a dependency for a different project. In general, I'd recommend making a new virtual environment for every project that you work on.
Containerization and Docker
In this same vein, sometimes projects are more complex than just what python packages to install. Projects that require system level libraries, or even specific operating systems, can be very difficult to port to new computers or collaborators. To eliminate this friction, Docker is a software that ships a very lightweight version of the operating system and necessary dependencies, that can be run on your own computer. I.e. if there is a complex software that you want to use that only runs on linux, but you have a Mac, then docker allows you to simply download the container, and suddenly you can run the entirely software as if your computer was linux. Convienently, like with remote development, VS Code also supports development within Docker containers. The extension is very similar to that of remote development and you can effectively think a docker container as a remote desktop that just lives on your computer that you're SSHing into.