github_workshop/github_workshop.qmd at master · ScottClaessens/github_workshop · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
---
title: "Git and GitHub for Beginners"
subtitle: "A Brief Walkthrough"
author: "Scott Claessens"
date: "`r Sys.Date()`"
format:
  html:
    embed-resources: true
toc: true
---

This document accompanies the workshop on "Git and GitHub for Beginners". We will walk through a basic Git workflow using RStudio.

This walkthrough assumes that students already have some experience with the R programming language and RStudio.

After this walkthrough, students will have experience with using Git for version control and pushing changes to GitHub.

# Getting started

## Installation

Before starting this walkthrough, you will need to ensure that you have **R**, **RStudio**, and **Git** installed on your machine. You will also need a **GitHub** account.

-   To install the latest version of R, see here: <https://cran.r-project.org/bin/windows/base/>

-   To install RStudio, see here: <https://posit.co/download/rstudio-desktop/>

-   To install Git, see here: <https://git-scm.com/downloads>

-   To sign-up to GitHub for free, see here: <https://github.com/signup>

If you are using a university computer, you may need to install some of this software from your university's list of approved applications (e.g., Software Center).

Once you have finished these installations, you should check to see whether Git is installed properly. Open your Command Prompt (Windows) or Terminal (Mac) and run the following command:

`git --version`

If Git is installed correctly, the output will tell you which version of Git you are using.

## Configuration

Once Git is correctly installed, you will need to tell it who you are, so that it can properly label the version history for your projects. To do this, open your Command Prompt (Windows) or Terminal (Mac) and run the following command, replacing with your own full name and email address:

`git config --global user.name "Scott Claessens"`

`git config --global user.email "scott.claessens@gmail.com"`

# Introduction

Scientific workflows are increasingly computational. In a standard workflow, we collect our data, plot the data, analyse the data, and draw inferences from our work. For scientific credibility, it is important that this whole process is reproducible – in other words, that someone else could run the same analysis pipeline and get the same results. Unfortunately, many researchers fail to make their data and analysis code publicly available, hindering reproducibility.

This is where Git comes in. Git is a version control tool that allows researchers to track the history of their projects and publicly document their data and code on GitHub.

In this walkthrough, we are going to play the role of anthropologists studying the !Kung San people in the Kalahari desert. We have just returned from our fieldwork, where we conducted a census of 352 adults. We are interested in the relationship between the heights and weights of people in this population.

We are going to set about analysing these data, using Git to keep track of our project.

# Creating a new project in RStudio

First, we need to create a new project in RStudio. We will use RStudio as it has some handy capabilities for Git – doing everything in the command line can be daunting for new users!

To create a new project, open RStudio and go to:

*File -\> New Project...*

You will see the following dialogue box:

![](images/clipboard-581556262.png)

Select the "*New Directory"* option, and then choose "*New Project*". This will create a new directory (folder) on your machine. Name the directory "*my_first_project*" and choose somewhere sensible to save it. Finally, tick the box "*Create a Git repository*" and click "*Create Project*".

![](images/clipboard-1667386865.png)

Voilà! The new directory should now appear on your machine. Let's check it out.

![](images/clipboard-4140519357.png)

Several files and folders have been initialised in the directory. What are these?

-   **.git/** is the Git repository for your project. It's technically an invisible folder, because you're not supposed to edit anything inside it by hand. It contains information and metadata about the entire history of your project. RStudio created this by essentially running `git init` for you. Don't delete this folder, unless you want to lose the history of your project!

-   **.Rproj.user/** is another invisible folder that contains information and metadata for your R project. Again, you shouldn't edit this folder by hand.

-   **my_first_project.Rproj** is a shortcut file to your R project. Double-clicking this will open your project within RStudio.

-   **.gitignore** is a file that tells Git which files it should ignore when tracking changes and pushing to online repositories. You can open it in a text editor if you're curious. RStudio has created it with some defaults. We won't cover ignoring files in this walkthrough, but if you wanted to declare files for Git to ignore (e.g., very large data files, files with personally identifying information), you could list them here.

If you go back to RStudio, it will have switched to this new project. The "*Files*" pane shows you the existing files in the directory. Our setup is complete!

# Your first commit

We have created our new R project, which points to a directory on our local machine. But we haven't really started working with Git yet. Let's do that now.

In RStudio, click into the "*Git*" tab. You should see the following:

![](images/clipboard-5717948.png)

This tab gives us a list of files that have changed since our last commit – for us, since we have no commits yet, it lists all the files. In order to commit these files, we need to first stage them by ticking the "*Staged*" box for both files.

![](images/clipboard-79734866.png)

This is equivalent of running the following in the command line:

`git add .gitignore`

`git add my_first_project.Rproj`

The green "A" icon indicates that these files have been successfully added to the staging area and are now ready to be committed.

Click the "Commit" button. You will see a dialogue box listing the staged files. The bottom of the dialogue box shows the "diff" – that is, the changes in the files since the last commit. Because we haven't made a commit yet, it highlights all the lines in these files.

The right-hand side of the dialogue box allows us to write a message for the commit. Usually, we would like to make these commit messages informative and descriptive. Since this is our first commit, we will just write "Initial commit":

![](images/clipboard-4115429810.png)

Click "Commit" – and then celebrate your first commit with Git! 🎉

Note that this process is the equivalent of running `git commit -m "Initial commit"` in the command line. RStudio just ran it for us!

To make sure that this commit worked as expected, you can click the "Diff" button in the Git tab. If you switch to the "History" tab in the dialogue box, you will see the commit that we just made:

![](images/clipboard-669716510.png)

# Adding the data file

Now that we have finished our setup, we can start building our project. Recall that we are playing the role of anthropologists studying the heights and weights of the !Kung San people in the Kalahari desert. In our fieldwork, we collected these data for 352 adults.

You should download the `kung_san_data.csv` dataset from here: <https://github.com/ScottClaessens/github_workshop>

The dataset has four columns:

-   **height** - height in centimetres

-   **weight** - weight in kilograms

-   **age** - age in years

-   **male** - gender, female (0) or male (1)

Once you have downloaded the dataset, you should create a new directory (folder) called "*data*" within your project directory and copy the data file into it. Separate directories for our data, code, plots, etc. will help keep the project tidily organised, both for ourselves and for other users.

As an aside, it is worth noting that Git was not designed for large data files. This .csv file is okay, as it is only a few KBs. But if you are working with very large datasets, you might want to consider using software like Git LFS: <https://git-lfs.com/>

Back to our dataset. Since we are planning to eventually push this dataset up onto GitHub for the whole world to see, it is important that we clearly document the data file. If someone finds the dataset later on, they will need to know what the different columns represent. This can be done in a few ways, such as with metadata files and data dictionaries. In this case, we will create a README file in the "data" directory.

To do this, go to:

*File -\> New File -\> Text File*

Name the file `README.md` and save it in the "data" directory. This is a Markdown file that will view well on GitHub.

In the README file, copy the following text:

``` {style="background-color: #ededed;"}
# Information about the dataset

Data on 352 !Kung San adults.

Columns:
- **height** - height in centimetres
- **weight** - weight in kilograms
- **age** - age in years
- **male** - gender, female (0) or male (1)
```

If you would like, you can preview what this file will look like on GitHub by clicking the "Preview" button.

Once you save this file, you should have two files in your new *data* directory – the .csv dataset and the README file. Your whole directory should have the following structure:

``` {style="background-color: #ededed;"}
├── .gitignore
├── data
│   ├── kung_san_data.csv
│   └── README.md
└── my_first_project.Rproj
```

Following the same workflow as before, stage these files and commit them with the following message: "*Add data file*"

![](images/clipboard-1849699933.png)

You will see that another commit has been added to your project history.

![](images/clipboard-2048356255.png)

# Pushing to GitHub

We have added the dataset to our repository, but we haven't yet written any analysis code. Before we do that, though, we will connect our local repository up with GitHub.

The first step is to create a new repository on GitHub. To do this, go to your GitHub account main page and click on the + button in the top right-hand corner. Select "*New repository*".

![](images/clipboard-2563427591.png)

Name the new repository "*my_first_project*" and click "Create repository", leaving all the defaults as they are.

When the repository is created on GitHub, it will be empty and you will see a bunch of (confusing!) suggestions. Ignore them for now and head back to RStudio.

What we need to do now is to link up our local repository with the remote repository that we just created on GitHub, so that Git knows where to push the changes. To do that, first open the Terminal in RStudio by clicking on the gear button in the Git tab and selecting "*New Terminal*":

![](images/clipboard-3026623573.png)

This will open a fresh terminal (command prompt) in your RStudio console. Then type the following into the terminal, replacing the URL link with the URL to your own remote repository:

`git remote add origin https://github.com/ScottClaessens/my_first_project.git`

`git push origin master`

These commands tell Git (1) where to link up to the remote repository, and (2) to push the master branch (our only branch) to the remote repository.

Once those commands have finished, you can refresh the page for your GitHub repository. Lo and behold – your local changes will magically appear! ✨

![](images/clipboard-2234609560.png)

We can see all the files, and even click into the data folder to see the README that we created.

![](images/clipboard-2112177838.png)

This remote repository now exists alongside our local repository. At regular intervals, we should push our changes up to GitHub so that the remote repository stays up-to-date.

# Adding the analysis code

Now that we have added our data file and documented the data clearly online, we would like to plot and analyse these data to determine whether there is a relationship between height and weight in this sample of !Kung San adults.

Back in RStudio, create two new folders in your project repository – one called "*R*" and another called "*plots*". This is where our analysis code and exported plots will live.

Then create four R scripts and save them inside the new R folder:

-   `load_packages.R`
-   `load_data.R`
-   `fit_model.R`
-   `plot_results.R`

Note how we have laid out the steps of our research pipeline in separate R scripts. Splitting things up in this way might seem silly in our simple example. In general, though, this is a good habit to get into. Modularising code has huge benefits for readability and maintenance.

In each of the new R scripts, copy the following R code:

```{r eval=FALSE}
# File: load_packages.R
# Purpose: Load R packages for the analysis

library(tidyverse)

```

```{r eval=FALSE}
# File: load_data.R
# Purpose: Load the !Kung San census data

data <-
  read_csv(
    file = "data/kung_san_data.csv",
    show_col_types = FALSE
    )

```

```{r eval=FALSE}
# File: fit_model.R
# Purpose: Fit linear regression model to the data

model <-
  lm(
    formula = height ~ scale(weight), # standardise weight predictor
    data = data
  )

print(summary(model))

```

```{r eval=FALSE}
# File: plot_results.R
# Purpose: Plot relationship between height and weight

plot <-
  ggplot(
    data = data,
    mapping = aes(
      x = weight,
      y = height
    )
  ) +
  geom_point() +
  # plot regression line
  stat_smooth(
    method = "lm",
    formula =  y ~ x
    ) +
  theme_classic()

print(plot)

# save plot to plots/ directory
ggsave(
  filename = "plots/results.pdf",
  plot = plot,
  height = 3,
  width = 3
)

```

Note how each R script is well-documented, with header comments explaining their purpose and other comments throughout to explain the meaning of particular lines or functions.

Once you have created these files, create one last R file called `run.R` and save it in the root of your project directory. Copy in the following R code:

```{r eval=FALSE}
# run the pipeline
source("R/load_packages.R")
source("R/load_data.R")
source("R/fit_model.R")
source("R/plot_results.R")

```

This is the master file that will run the analysis scripts in the correct order. Notice how the file paths do not refer to *specific* file paths on my local machine, but instead are *relative* file paths from the project root. This is much better practice than using `setwd()` in your R scripts! If you were to instead refer to file paths on your own machine, other users will not be able to run the code on their machines.

Once you have finished all of that, your project directory should have the following structure:

``` {style="background-color: #ededed;"}
├── .gitignore
├── data
│   ├── kung_san_data.csv
│   └── README.md
├── my_first_project.Rproj
├── plots
├── R
│   ├── fit_model.R
│   ├── load_data.R
│   ├── load_packages.R
│   └── plot_results.R
└── run.R
```

Phew – now all that is done, feel free to kick back and run the pipeline by sourcing `run.R`. You will see from the regression results and the exported scatterplot that there is a positive relationship between height and weight in this sample of !Kung San adults.

But we're not quite done yet – we need to commit these changes with Git and push them to GitHub. Since you're quickly becoming a pro at committing changes with RStudio's point-and-click interface, why don't we try this time with the command line?

Open up a fresh terminal within RStudio, as you have done before, and type the following commands:

`git add .`

`git commit -m "Add analysis code"`

`git push origin master`

The first command uses the full stop to stage *all* new files in the directory, rather than declaring them one-by-one. The second command commits the new changes with a descriptive message. The third command pushes the changes to GitHub.

If you go back to your remote GitHub repository, you will find that it has been updated to reflect our new changes.

# Adding a project README

We are finally finished with our analyses! That's great.

Currently, though, our remote repository is not in a great state. Yes, we have commented our code and documented our dataset. But if someone stumbled upon this repository on the internet, they wouldn't necessarily know what the purpose of the project was or how they can reproduce the analyses themselves. To help with this, we will add an informative README to the landing page of our project repository.

To create this file, go to:

*File -\> New File -\> Text File*

Name the file `README.md` and save it at the root of your project directory. In the README file, copy the following text, being sure to replace with your own name and contact information:

```` {style="background-color: #ededed;"}
# My First Project with Git and GitHub

This repository contains the data and analysis code for the project on the
relationship between height and weight in the !Kung San.

## Getting started

### Installation

To run this code, you will need to [install R](https://www.r-project.org/) and
the following R packages:

```r
install.packages("tidyverse")
```

### Executing the code

To run the code, clone this repository to your local machine and set your
working directory to this repository. In the console, run:

```r
source("run.R")
```

The code will fit a regression model to the data and export a scatterplot of
heights and weights. The code should take no longer than a minute to run on a
standard laptop.

## Help

Any issues, please email scott.claessens@gmail.com.

## Authors

Scott Claessens, scott.claessens@gmail.com
````

Just like our data README file, this is a Markdown file that will view well on GitHub. Notice that this file:

1.  Gives the repository a clear title and describes the broad purpose of the project.
2.  Tells users what software they will need to install to be able to run the code.
3.  Tells users what the code does and how long (roughly) it will take to run.
4.  Gives clear information about who authored the code and who to contact if any issues arise.

This will be a much better landing page for our GitHub repository. Let's commit this file and push to GitHub.

Open the Terminal and run the following commands:

`git add README.md`

`git commit -m "Add project README"`

`git push origin master`

If you now go back to your remote GitHub repository, you will see that a nice looking README has been added to the landing page. You might also want to add an informative description of the repository under the "About" heading on the right-hand side:

![](images/clipboard-2495465091.png)

# Further reading

Congratulations! You have successfully used Git and GitHub to build a project repository from the ground up. 🎉

This workshop has hopefully taught you the following skills:

-   How to use Git for version control, using both RStudio and the command line

-   How to push your Git repositories up to GitHub

-   How to clearly organise data analysis projects and modularise your code

-   How to clearly document your projects with README files and comments

This workshop has only scratched the surface of what is possible with Git. There are many topics we haven't covered, including ignoring files, reverting to previous states of the repository, creating branches, and collaborating with others.

If you're interested in learning more about Git and GitHub, I recommend the following resources:

-   Git Tutorial for Beginners: Learn Git in 1 Hour (<https://www.youtube.com/watch?v=8JJ101D3knE>)

-   Git & GitHub Tutorial for Scientists: It’s Not Only for Programmers (<https://gitbookdown.dallasdatascience.com/>)

I hope that you can apply these tools in your scientific workflow!