rm(list = ls()) # clean-up workspace

Announcement

Reproducible Research

Definition

In computational sciences this means:

the data and code used to make a finding are available and they are sufficient for an independent researcher to recreate the finding.

Christopher Gandrud

Why

  • Science is about accumulating knowledge and reproducibility is the foundation.

  • Greater research impact. Note: “a study can be reproducible and still be wrong.”

  • Better work habits

    • avoid thinking ‘no one was looking’

    • communicate with future-you

  • Better teamwork

    • communicate with another person. For example, your instructor with your HWs.

Tools for reproducible research

  • Version control: Git+GitHub.

  • Distribute method implementation, e.g., R/Python/Julia packages, on GitHub or bitbucket.

  • Dynamic document: RMarkdown for R or Jupyter notebook for Julia/Python/R.

  • Docker container for reproducing a computing environment.

  • Cloud computing tools.

  • We are going to practice reproducible research now. That is to make your homework reproducible using Git, GitHub, and RMarkdown.


Collaborative research

Data scientists and statisticians, as opposed to closet mathematicians, rarely do things in vacuum.

In every project you have at least one other collaborator, future-you. You don’t want future-you to curse past-you.

Hadley Wickham

Why version control?

  • A centralized repository helps coordinate multi-person projects.

  • Time machine. Keep track of all the changes and revert back easily (reproducible).

  • Storage efficiency.

  • Synchronize files across multiple computers and platforms.

  • GitHub is becoming a de facto central repository for open source development. Hadley Wickham also recommends Git/GitHub as the best practices for R package development.

  • Advertise yourself through GitHub (e.g., a tutorial).

  • AmStat article: What Is Data Science Specialization, and What Can It Do for You?

What should an employer look for when they see a certification on a résumé?

For our program, and likely data science in general, they should look at the applicant’s GitHub page. They should see interesting project and code contributions.

Git

We use Git in this course.

I’m an egotistical bastard, and I name all my projects after myself. First ‘Linux’, now ‘git’.

Linus Torvalds

What do I need to use Git?

  • A Git server enabling multi-person collaboration through a centralized repository.

    • github.com: unlimited public repositories, private repositories costs $, but unlimited private repositories for free from Student Developer Pack.

    • bitbucket.org: unlimited public repositories, unlimited private repositories for academic account (register for free using your edu email).

    • We use github.com in this course for developing and submitting homework.


  • A Git client on your own machine.

    • Linux: Git client program is shipped with many Linux distributions, e.g., Ubuntu and CentOS. If not, install using a package manager, e.g., yum install git on CentOS.

    • Mac: Most versions of MacOS already have Git installed. Follow instructions at https://www.atlassian.com/git/tutorials/install-git if you need to install it yourself.

    • Windows: Git for Windows (GUI) aka Git Bash.

  • RStudio supports Git.

  • Do not totally rely on GUI or IDE. Learn to use Git on command line, which is needed for cluster and cloud computing.

Git workflow

Git survival commands

  • Synchronize local Git directory with remote repository:

    git pull

    same as git fetch plus git merge.

  • Modify files in local working directory.

  • Add snapshots to staging area:

    git add FILES
  • Commit: store snapshots permanently to (local) Git repository

    git commit -m "MESSAGE"
  • Push commits to remote repository:

    git push

Git basic usage

  • Register for an account on a Git server, e.g., github.com.

  • (Optional) Upload your SSH public key to the server.

  • Identify yourself at local machine, e.g.,

    git config --global user.name "Your Name"
    git config --global user.email "your_email@tulane.edu"

    Name and email appear in each commit you make.

  • Initialize a project.

    • Create a repository, e.g., tulane-math-7360-2023 on GitHub.

    • Clone the repository to your local machine:

    git clone https://github.com/tulane-math-7360-2023/Math-7360-Xiang-Ji.git
  • Working with your local copy.

    • git pull: update local Git repository with remote repository (fetch + merge).

    • git log FILENAME: display the current status of working directory.

    • git diff: show differences (by default difference from the most recent commit).

    • git add file1 file2 ...: add file(s) to the staging area.

    • git commit: commit changes in staging area to Git directory.

    • git push: publish commits in local Git repository to remote repository.

    • git reset --soft HEAD~1: undo the last commit.

    • git checkout FILENAME: go back to the last commit, discarding all changes made.

    • git rm FILENAME: remove files from git control.

Branching in Git


  • For this course, you need to have two branches:

    • develop for your own development.

    • master for releases, i.e., homework submission.

    • Note master is the default branch when you initialize the project; create and switch to develop branch immediately after project initialization.


  • Commonly used commands:

    • git branch branchname: create a branch.

    • git branch: show all project branches.

    • git checkout branchname: switch to a branch.

    • git tag: show tags (major landmarks).

    • git tag tagname: create a tag.

Sample session | Getting started with homework

On GitHub.com:

  • Obtain Student Developer Pack. (Should be done by now)

  • Create a private repository Math-7360-FirstName-LastName on tulane-math-7360-2023.

On your local machine:

  • Clone the repository, create develop branch, where you work on solutions.

    # clone the project
    git clone https://github.com/tulane-math-7360-2023/tulane-math-7360-2023.github.io.git
    # enter project folder
    cd tulane-math-7360-2023
    # what branches are there?
    git branch
    # create develop branch
    git branch develop
    # switch to the develop branch
    git checkout develop
    # create folder for HW1
    mkdir hw1
    cd hw1
    # let's write solutions
    echo "sample solution" > hw1.Rmd
    echo "some bug" >> hw1.Rmd
    # commit the code
    git add hw1.Rmd
    git commit -m "start working on problem #1"
    # push to remote repo
    git push
  • Submit and tag HW1 solution to master branch.

    # which branch are we in?
    git branch
    # change to the master branch
    git checkout master
    # merge develop branch to master branch
    # git pull origin develop 
    git merge develop
    # push to the remote master branch
    git push
    # tag version hw1
    git tag hw1
    git push --tags
  • RStudio has good Git integration. But practice command line operations also.

Etiquettes of using Git

  • Be judicious what to put in repository.

    • Not too less: Make sure collaborators or yourself can reproduce everything on other machines.

    • Not too much: No need to put all intermediate files in repository. Make good use of the .gitignore file.

    • Strictly version control system is for source files only. E.g. only xxx.Rmd, xxx.bib, and figure files are necessary to produce a pdf file. Pdf file doesn’t need to be version controlled or, if version controlled, doesn’t need to be frequently committed.

    • However, for your homework, do commit an HTML file for me.

  • Commit early, commit often and don’t spare the horses.

  • Adding an informative message when you commit is not optional. Spending one minute on commit message saves hours later for your collaborators and yourself. Read the following sentence to yourself 3 times:

Write every commit message like the next person who reads it is an axe-wielding maniac who knows where you live.