Lecture Videos
Note: The code-through tutorials in this course demonstrate connecting GitHub Desktop with RStudio on a local computer. Some students in the past have inquired about options to use Posit/RStudio Cloud. While I am unable to provide support for this set-up, there are some tutorials available online that may be of assistance such as the following:
- RStudio Cloud & Github Repository Connection Tutorial:
- Video: https://www.youtube.com/watch?v=w6fivjMGZVo
- Accompanying Guide: https://rverse-tutorials.github.io/RWorkflow-NWFSC-2021/set-up.html
- RStudio Cloud & Github Repository Connection Additional Guide: https://warin.ca/dpr/git.html
Relative File Paths Alternative Method
As you will learn in the following segments of this lab, we will be using relative file paths extensively in this course to ensure that our code is reproducible. In the code-through video above and throughout the tutorials for the course, the best-practice scenario of working within your team’s RStudio Project will be presented.
However, if for any reason you find that you are unable to knit your files, access graphing/mapping features, etc within the RStudio project, you can utilize a handy function from the here
package called i_am()
to automatically identify the location of your .RMD file outside of the RStudio project so that you can utilize your local installation of R and still utilize relative file paths for the rest of the lab.
Watch the video below for details on how to utilize this handy feature:
A copy-enabled snippet of the syntax for here::i_am()
displayed in the video.
Note that the relative file path starts after the name of your team’s project folder.
-
here::i_am()
function documentation: https://here.r-lib.org/articles/here.html#declare-the-location-of-the-current-script-1
Introduction
Welcome to PAF 515 Data Science III Project Management (formerly known as CPP 528)!
In this course we will be exploring tools and strategies that will allow you and your groupmates to successfully create and manage a reproducible data science project from start to finish. Specifically, you will be conducting an evaluation of whether two popular tax credit programs (Low Income Housing Tax Credit Program and New Markets Tax Credit Program) have been successful in reducing social vulnerability in the United States.
In order to accomplish this, you and your team members will be assigned to a shared GitHub repository where you will get to practice working in a shared data space and using git for version control, maintaining a Kanban board for task management, and finally presenting a combined report on a GitHub pages website.
Each of the labs throughout the course will contribute to an element of the final project GitHub Pages website which will have the following sections:
- Executive Summary of entire project (4-6 sentences for each section) (Lab 07)
- Overview (Details on Tax Credit Programs, purpose of the evaluation, and hypothesis)
- Data (Description of data sources used in project)
- Methods (Description of statistical and data analytics methods used throughout the course)
- Results (Findings on whether the tax credits were effective overall)
- Division-specific reports
- Overview of specific social vulnerabilities in division (Lab 02)
- Infographics and Maps of socially vulnerable areas in division (Lab 03)
- Visualization of distribution of tax credits in division and correlation to social vulnerability (Lab 04)
- Measuring intervention and evaluating effectiveness of tax credits with diff-in-diff models (Lab 05)
- Summary of results/findings (Lab 06)
- Results and Conclusion (Lab 07)
- Detailed summary of findings on the effectiveness of tax credit programs across divisions
- Detailed summary of changes in social vulnerability over time across divisions
- Conclusion/suggestions for future research
- Team About Us Page
- Brief bios of each team member
- References
- List of all articles/resources cited throughout the report. Can be in any citation style.
As outlined above, your team will collaboratively create an overall summary of the tax credits programs and each individual team member will be responsible for contributing an analysis of a U.S. Census division of their choice from the following list:
- New England Division (Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, Vermont)
- East North Central Division (Indiana, Illinois, Michigan, Ohio, Wisconsin)
- West North Central Division (Iowa, Nebraska, Kansas, North Dakota, Minnesota, South Dakota, Missouri)
- South Atlantic Division (Delaware, District of Columbia, Florida, Georgia, Maryland, North Carolina, South Carolina, Virginia, West Virginia)
- East South Central Division (Alabama, Kentucky, Mississippi, Tennessee)
- West South Central Division (Arkansas, Louisiana, Oklahoma, Texas)
- Mountain Division (Arizona, Colorado, Idaho, New Mexico, Montana, Utah, Nevada, Wyoming)
- Pacific Division (Alaska, California, Hawaii, Oregon, Washington)
Note: The United States Census Bureau has defined a total of 9 divisions within the United States. However, since all the tutorials throughout the course will use the Middle Atlantic Division (Pennsylvania, New Jersey, New York) for examples it has been excluded from the list for selection and students can not pick this division. All other divisions are available for selection with the limitation that each member of a group must select a unique division within their group. Each group will receive a link to a Google Sheet where students can declare their division selection to ensure there are no duplicates. If you have any questions, please contact the instructor. Thank you!
To further illustrate this process, let’s pretend that we have a PAF 515 Team with four members (Minnie, Mickey, Nala, and Sparky):
Minnie selects the South Atlantic Division
Mickey selects the Pacific Division
Nala selects the West South Central Division
Sparky selects the Mountain Division
Thus their report will contain the following:
- Executive Summary of entire project completed as a group (Lab 07)
- Overview (Details on Tax Credit Programs, purpose of the evaluation, and hypothesis)
- Data (Description of data sources used in project)
- Methods (Description of statistical and data analytics methods used throughout the course)
- Results (Findings on whether the tax credits were effective overall)
- Division-specific reports:
- South Atlantic Division Report by Minnie
- Overview of specific social vulnerabilities in division (Lab 02)
- Infographics and Maps of socially vulnerable areas in division (Lab 03)
- Visualization of distribution of tax credits in division and correlation to social vulnerability (Lab 04)
- Measuring intervention and evaluating effectiveness of tax credits with diff-in-diff models (Lab 05)
- Summary of results/findings (Lab 06)
- Pacific Division Report by Mickey
- Overview of specific social vulnerabilities in division (Lab 02)
- Infographics and Maps of socially vulnerable areas in division (Lab 03)
- Visualization of distribution of tax credits in division and correlation to social vulnerability (Lab 04)
- Measuring intervention and evaluating effectiveness of tax credits with diff-in-diff models (Lab 05)
- Summary of results/findings (Lab 06)
- South Central Division Report by Nala
- Overview of specific social vulnerabilities in division (Lab 02)
- Infographics and Maps of socially vulnerable areas in division (Lab 03)
- Visualization of distribution of tax credits in division and correlation to social vulnerability (Lab 04)
- Measuring intervention and evaluating effectiveness of tax credits with diff-in-diff models (Lab 05)
- Summary of results/findings (Lab 06)
- Mountain Division Report by Sparky
- Overview of specific social vulnerabilities in division (Lab 02)
- Infographics and Maps of socially vulnerable areas in division (Lab 03)
- Visualization of distribution of tax credits in division and correlation to social vulnerability (Lab 04)
- Measuring intervention and evaluating effectiveness of tax credits with diff-in-diff models (Lab 05)
- Summary of results/findings (Lab 06)
- Results and Conclusion (Lab 07)
- Detailed summary of findings on the effectiveness of tax credit programs across divisions
- Detailed summary of changes in social vulnerability over time across divisions
- Conclusion/suggestions fo future research
- References (Lab 07)
The executive summary and results and conclusion sections summarizing the entire study will be completed by Lab 07 and the division-specific sections of the report will be completed in Labs 02-06 (and simply edited by Lab 07 for any necessary corrections).
Project Section Descriptions
1) Executive Summary:
The executive summary will be a written narrative where you describe the research question for this project, (Ex. Do governmental programs (NMTC and LIHTC) have an impact on social vulnerability in a community?), the data used (US Census Data from the American Community Survey, the CDC’s SVI methodology guidelines, NMTC/LIHTC eligibility and credit distribution data), the methodologies used (correlation calculations, diff-in-diff regression models, etc.), and results (Did the tax credits have an impact nationally or in all of your team’s divisions? Some of your team’s divisions? None of them? Did one program work better than the other?).
2) Division-Specific Reports:
In these sections you will present the labs you create each week and provide a detailed analysis of your specific division.
3) Results and Conclusion:
Similar to the executive summary, give an overarching written summary of the outcomes of your report. Compare and contrast between your team’s divisions and give a conclusion of your findings. You can also provide suggestions for changes you would make in a future study/other factors you would like to explore.
Data Science Project Management Strategies
Before we dive into working on our project, for lab 01 we will get our project environment set-up and study the theory behind the data science project management tools/strategies that we will utilize over the next few weeks:
- Importance of Reproducibility for Data Science Project Management
- GitHub Desktop/Git and RStudio
- Kanban Boards
- Using .R files for importing reusable functions/variables
- Census API Pulls
By the end of this lab you will be able to describe reproducibility in a data science context, connect GitHub Desktop with RStudio, set up a Kanban Board to manage your project, import reusable variables from a .R file, and pull some preliminary data on your division of interest from the U.S. Census Bureau’s API.
Reproducibility
As declared in the Stanford Psychology Guide to Doing Open Science, “The primary goal of reproducible data analysis is to ensure computational reproducibility — that is, the ability of another researcher to use one’s code and data to independently obtain identical results.”
Throughout the PEDA program we’ve discussed several advantages of using open source analysis tools like R and Python as data scientists, analysts, and evaluators. Arguably the most important of these advantages is the ability to readily collaborate and contribute to others’ work in ways that are not possible in proprietary coding environments. However, this ability to collaborate/share data and information is only possible if we set up our data projects in a way that ensures that other analysts can readily run our exact code and reproduce our results with very minimal (or ideally zero) changes needed.
Thus, it is important that we have a good understanding of reproducibility and how to create a reproducible environment for our projects.
The following videos from the Duke Center for Data and Visualization Services provide a good overview of these concepts:
Throughout this course we will explore an entire toolbox of packages and strategies that assist us in implementing these Tenets of Reproducibility. One of the most vital of these tools/strategies is the here
package in R which allows us to easily implement relative file paths in our project and works seamlessly with importing data and functions from a .R file for reproducibility purposes.
This will allow you to format your final projects in a way that anyone who accesses your repository can re-produce your results and run your code on their machine without having to make any changes to your code.
As long as you implement these tools in your labs, you will not need to make any changes for the final report. In addition, as you work on your labs, it will also be beneficial to be mindful of what output from your code chunks should be displayed in the report and what output should be hidden.
For example, while it is important to display numbers that have been calculated, it is not good practice to include warnings and messages from loading packages or downloading data from an API in your report. You can control what you would like to include and exclude in your report by adjusting the settings of your R chunks in RStudio before you knit your .RMD file to an .MD and/or .HTML file.
The resources below can assist you with implementing these concepts:
Resources
- RStudio Projects: https://support.posit.co/hc/en-us/articles/200526207-Using-RStudio-Projects
- here() package: https://here.r-lib.org/
- File Paths: https://www.r4epi.com/file-paths.html
- Relative File Paths: https://ytakemon.github.io/2019-10-22-R-BCCRC/02-filedir/
- More on relative file paths: https://excelquick.com/r-programming/importing-data-absolute-and-relative-file-paths-in-r/
- Import package: https://cran.r-project.org/web/packages/import/vignettes/import.html
- How to import functions with source(): https://www.earthdatascience.org/courses/earth-analytics/multispectral-remote-sensing-data/source-function-in-R/
- Code Chunk Options: https://kbroman.org/knitr_knutshell/pages/Rmarkdown.html
- More on Code Chunk Options: https://yihui.org/knitr/options/#chunk_options
- Even More on Code Chunk Options: https://rmarkdown.rstudio.com/lesson-3.html
Git & GitHub Repository
As discussed in the videos from the Duke Center for Data and Visualization Services on reproducibility, knowing how to utilize version control and create repository structures for your data projects via Git and GitHub are essential skills.
Within this course’s Canvas site there is a survey requesting for you to submit your GitHub username.
Once you submit the survey, you will be sent an invite to join your team’s repository. You will use this shared space to store your data, create a Kanban board, and to connect RStudio on your local computer to GitHub desktop.
If you do not already have Git and/or GitHub Desktop installed, please install these tools using the following links:
- Git: https://git-scm.com/downloads
- GitHub Desktop: https://desktop.github.com/
The following video provides a great overview of GitHub Desktop/Git in action. However, please note that the programmer is using VSCode instead of RStudio so do not follow along with the tutorial. We will go through the process of connecting RStudio and GitHub in the lecture video, but the explanations in this video are also helpful:
Github Repo File Structure
As displayed in the video, GitHub Desktop will allow us to seamlessly connect the files on our local computer to our remote repository on GitHub where we will share our data/code with our teammates.
When you receive access to your team’s repository, you will find that there is already a general structure with folders, the raw data you will need and simplistic README files in the folders.
However, by the end of your project you will want your GitHub repository to be similar to Minnie, Mickey, Nala, and Sparky’s below:
. (root/main/project directory, such as Watts-College/paf-515-spr-1946-group-01)
├── README.md
├── analysis
│ ├── README.md
│ ├── project_data_steps_minnie.R
│ ├── project_data_steps_mickey.R
│ ├── project_data_steps_nala.R
│ └──project_data_steps_sparky.R
├── assets
│ ├── README.md
│ ├── css
│ │ └── README.md
│ └── images
│ └── README.md
├── <RStudio Project Label>.Rproj
├── imgs
│ ├── README.md
├── resources
│ ├── README.md
├── data
│ ├── README.md
│ ├── raw
| │ ├── Census_Data_SVI
│ | | ├── README.md
│ | | ├── Census_2010_Geography_Notes.pdf
│ | | ├── census_data_svi_2010_variables.txt
│ | | ├── census_data_svi_2020_variables.txt
│ | | ├── census_regions.xlsx
│ | | ├── svi_2010_trt10.rds
│ │ | └── svi_2020_trt10.rds
| │ └── NMTC_LIHTC_tracts
│ | ├── README.md
│ | ├── nmtc_2011-2015_lic_110217.xlsx
│ | ├── NMTC_Public_Data_Release_includes_FY_2021_Data_final.xlsx
│ | ├── qct_data_2010_2011_2012.xlsx
│ | ├── lihtcpub.zip
│ | ├── lihtcpub
| | ├──LIHTCPUB.csv
| | └──LIHTC Data Dictionary 2021.pdf
│ ├── wrangling
│ │ ├── README.md
│ │ ├── South_Atlantic_Division_county_svi_flags10.rds
│ │ ├── South_Atlantic_Division_county_svi_flags20.rds
│ │ ├── South_Atlantic_Division_st_sf.rds
│ │ ├── South_Atlantic_Division_svi_divisional_lihtc.rds
│ │ ├── South_Atlantic_Division_svi_divisional_nmtc.rds
│ │ ├── South_Atlantic_Division_svi_national_lihtc.rds
│ │ ├── South_Atlantic_Division_svi_national_nmtc.rds
│ │ ├── Pacific_Division_county_svi_flags10.rds
│ │ ├── Pacific_Division_county_svi_flags20.rds
│ │ ├── Pacific_Division_st_sf.rds
│ │ ├── Pacific_Division_svi_divisional_lihtc.rds
│ │ ├── Pacific_Division_svi_divisional_nmtc.rds
│ │ ├── Pacific_Division_svi_national_lihtc.rds
│ │ ├── Pacific_Division_svi_national_nmtc.rds
│ │ ├── West_South_Central_Division_county_svi_flags10.rds
│ │ ├── West_South_Central_Division_county_svi_flags20.rds
│ │ ├── West_South_Central_Division_st_sf.rds
│ │ ├── West_South_Central_Division_svi_divisional_lihtc.rds
│ │ ├── West_South_Central_Division_svi_divisional_nmtc.rds
│ │ ├── West_South_Central_Division_svi_national_lihtc.rds
│ │ ├── West_South_Central_Division_svi_national_nmtc.rds
│ │ ├── Mountain_Division_county_svi_flags10.rds
│ │ ├── Mountain_Division_county_svi_flags20.rds
│ │ ├── Mountain_Division_st_sf.rds
│ │ ├── Mountain_Division_svi_divisional_lihtc.rds
│ │ ├── Mountain_Division_svi_divisional_nmtc.rds
│ │ ├── Mountain_Division_svi_national_lihtc.rds
│ │ └── Mountain_Division_svi_national_nmtc.rds
│ └── rodeo
│ │ ├── README.md
│ │ ├── South_Atlantic_Division_svi_divisional_lihtc.rds
│ │ ├── South_Atlantic_Division_svi_divisional_nmtc.rds
│ │ ├── South_Atlantic_Division_svi_national_lihtc.rds
│ │ ├── South_Atlantic_Division_svi_national_nmtc.rds
│ │ ├── Pacific_Division_svi_divisional_lihtc.rds
│ │ ├── Pacific_Division_svi_divisional_nmtc.rds
│ │ ├── Pacific_Division_svi_national_lihtc.rds
│ │ ├── Pacific_Division_svi_national_nmtc.rds
│ │ ├── West_South_Central_Division_svi_divisional_lihtc.rds
│ │ ├── West_South_Central_Division_svi_divisional_nmtc.rds
│ │ ├── West_South_Central_Division_svi_national_lihtc.rds
│ │ ├── West_South_Central_Division_svi_national_nmtc.rds
│ │ ├── Mountain_Division_svi_divisional_lihtc.rds
│ │ ├── Mountain_Division_svi_divisional_nmtc.rds
│ │ ├── Mountain_Division_svi_national_lihtc.rds
│ │ └── Mountain_Division_svi_national_nmtc.rds
├── labs
│ ├── README.md
│ ├── wk01
│ | ├── imgs
| | └──any image files from the lab
│ │ └── README.md
│ │ └── lab01_sparky.rmd
│ │ └── lab01_sparky.md
│ │ └── lab01_sparky.html
│ │ └── lab01_mickey.rmd
│ │ └── lab01_mickey.md
│ │ └── lab01_mickey.html
│ │ └── lab01_minnie.rmd
│ │ └── lab01_minnie.md
│ │ └── lab01_minnie.html
│ │ └── lab01_nala.rmd
│ │ └── lab01_nala.md
│ │ └── lab01_nala.html
│ ├── wk02
│ | ├── imgs
| | └──any image files from the lab
│ │ └── README.md
│ │ └── lab02_sparky.rmd
│ │ └── lab02_sparky.md
│ │ └── lab02_sparky.html
│ │ └── lab02_mickey.rmd
│ │ └── lab02_mickey.md
│ │ └── lab02_mickey.html
│ │ └── lab02_minnie.rmd
│ │ └── lab02_minnie.md
│ │ └── lab02_minnie.html
│ │ └── lab02_nala.rmd
│ │ └── lab02_nala.md
│ │ └── lab02_nala.html
│ ├── wk03
│ | ├── imgs
| | └──any image files from the lab
│ │ └── README.md
│ │ └── lab03_sparky.rmd
│ │ └── lab03_sparky.md
│ │ └── lab03_sparky.html
│ │ └── lab03_mickey.rmd
│ │ └── lab03_mickey.md
│ │ └── lab03_mickey.html
│ │ └── lab03_minnie.rmd
│ │ └── lab03_minnie.md
│ │ └── lab03_minnie.html
│ │ └── lab03_nala.rmd
│ │ └── lab03_nala.md
│ │ └── lab03_nala.html
│ ├── wk04
│ | ├── imgs
| | └──any image files from the lab
│ │ └── README.md
│ │ └── lab04_sparky.rmd
│ │ └── lab04_sparky.md
│ │ └── lab04_sparky.html
│ │ └── lab04_mickey.rmd
│ │ └── lab04_mickey.md
│ │ └── lab04_mickey.html
│ │ └── lab04_minnie.rmd
│ │ └── lab04_minnie.md
│ │ └── lab04_minnie.html
│ │ └── lab04_nala.rmd
│ │ └── lab04_nala.md
│ │ └── lab04_nala.html
│ ├── wk05
│ | ├── imgs
| | └──any image files from the lab
│ │ └── README.md
│ │ └── lab05_sparky.rmd
│ │ └── lab05_sparky.md
│ │ └── lab05_sparky.html
│ │ └── lab05_mickey.rmd
│ │ └── lab05_mickey.md
│ │ └── lab05_mickey.html
│ │ └── lab05_minnie.rmd
│ │ └── lab05_minnie.md
│ │ └── lab05_minnie.html
│ │ └── lab05_nala.rmd
│ │ └── lab05_nala.md
│ │ └── lab05_nala.html
│ ├── wk06
│ | ├── imgs
| | └──any image files from the lab
│ │ └── README.md
│ │ └── lab06_sparky.rmd
│ │ └── lab06_sparky.md
│ │ └── lab06_sparky.html
│ │ └── lab06_mickey.rmd
│ │ └── lab06_mickey.md
│ │ └── lab06_mickey.html
│ │ └── lab06_minnie.rmd
│ │ └── lab06_minnie.md
│ │ └── lab06_minnie.html
│ │ └── lab06_nala.rmd
│ │ └── lab06_nala.md
│ │ └── lab06_nala.html
│ └── wk07
│ └── README.md
│ └── If any final revisions/files are needed they can be stored here
As you can see, they’ve labeled all of their labs with their names and all of their output data is labeled with the proper division. These naming conventions/labels are critical to ensure that you do not override your teammates’ work while working in the same repository and that the data for your division is easily recognizable.
Always add your name/initials to the end of your personal .R, RMD, Markdown, and HTML files and double-check that your saved data files are labeled correctly. There will be reminders of this in every lab and code to help you with labeling data, but it’s also a good practice to remain mindful of this on your own so that you can apply these strategies to your future data science projects.
README Files
You may have noticed that each of the folders in the diagram above contains a README.md. So what is a README file? Make A README.com provides a fun, general overview: https://www.makeareadme.com/. However, more specific to our course purposes, your README files will serve as a summary of each segment of your project.
The main README (listed under the root of your project/“the main folder” of your repository), should give an overview of what the project was about, who worked on the project, and how the repository can be pulled/reproduced (provide a brief overview of the steps to fork and or download the repo.)
The sub-README files that are listed under the subfolders should provide a brief summary of what’s expected in each folder. For example, the labs/wk01/README.MD file should explain that the folder contains the .RMD, .MD, and .html files used to analyze and create a report for Lab 01. The README should also include a brief summary of Lab 01 that includes the purpose and findings of the lab.
The READMEs that will come along with your website template are sufficient to describe the template. You do not need to create new READMEs for those folders unless you make edits to the template itself. In that case, edit the existing README to describe what changes you made to the code (CSS/HTML/SASS, etc.).
Data & Passwords
Almost all of the data needed for this project will be provided in the raw
folder of the repo. The only additional step that you will need to take (if you have not already done it for past courses) is to request a census API key
: https://api.census.gov/data/key_signup.html.
You will want to keep this API private (think of it as a password), so you will need to store it in a separate password.R file
that you create and then import it into your .RMD file where you conduct the rest of your analyses.
To do this you will want to follow these steps:
-
Create a password.R file in the root of the repository folder on your local computer and add one line:
census_api_key = "YOUR_API_KEY"
Save the password.R file and then import it into the .RMD file where you will be working with the Census data.
Code
# Load here package
library(here)
# Find relative file path to password.R file
my_api <- here::here("password.R")
# Load census_api_key from password.R file
source(my_api)
# Print loaded census_api_key variable to check that the key is correct
# (remember that you will not want to include this step in your final report
# as it will reveal the key, but it will allow you as the developer
# to check that you're getting your expected output)
print(census_api_key)
[1] "abcdefghijklmnopqrstuvwxyz123456789"
Note: As an example of the code chunk options mentioned in the [Portability and Reproducibility] section, the code chunk listed here is written as follows:
Important Note: In order for the here package to work properly, you need to always ensure that you are working within your RStudio Project. You can check this by seeing if the name of your repo is next to the blue box on the top right-hand corner of RStudio (click the image below to enlarge):
-
Add the following to your .gitignore file:
# Ignore password file password.R
Additional details on how to pull API keys from a separate file are available in the following tutorial Accessing Web APIs:https://info201.github.io/apis.html.
Once you have access to your API key and you have processed/uploaded the data for your individual labs, you will not need to make any additional changes for the final project.
Assignments
Kanban Board
In addition to managing version control and storing your data and code, GitHub also offers project boards known as ‘Kanban Boards’.
What is Kanban?
Kanban is a visual system for managing work as it moves through a process. Kanban visualizes both the process (the workflow) and the actual work passing through that process. The goal of Kanban is to identify potential bottlenecks in your process and fix them so work can flow through it cost-effectively at an optimal speed or throughput.
The Kanban Method is an evolutionary improvement process. It helps you adopt small changes and improve gradually at a pace and size that your team can handle easily. It encourages the use of the scientific method – you form a hypothesis, you test it and you make changes depending on the outcome of your test… Your key task is to evaluate your process constantly and improve continuously as needed and as possible.
Kanban Boards in GitHub
The Kanban Method is an evolutionary improvement process. It helps you adopt small changes and improve gradually at a pace and size that your team can handle easily. It encourages the use of the scientific method – you form a hypothesis, you test it and you make changes depending on the outcome of your test… Your key task is to evaluate your process constantly and improve continuously as needed and as possible.
Just like GitHub pages are a powerful feature available in each repository, GitHub also has project management tools built right in. For this course, you will be provided with a project in your team’s repo to practice utilizing Kanban boards.
As part of Lab 01, you and your teammates will create a Kanban Board.
The Kanban board should have four boards:
- Ideas
- To-Do
- Doing
- Done
Next you and your team will want to breakdown the following project tasks into cards under To Do
:
- Creating Executive Summary (review the sections of the exec summary above)
- Tasks/sub-tasks for each member’s respective division
- Ensure all labs for your division are completed and error-free
- Ensure that code-chunk options have been set appropriately to hide warning/error messages
- Ensure that all functions are loaded from project_data_steps.R file and the code is not present in the output file
- Ensure that all code for visualizations is hidden from showing in the output file
- Ensure that all output data is limited and that there are not excessively long data sets in files
- Ensure that private information (such as API keys) are not visible anywhere in your code or directly in your Rmarkdown file, API keys must be imported from the password.R file or stored in the .Renviron file. Remember, it’s important to never store passwords on GitHub. If you don’t know if you are about to send confidential information - or if you have - reach out to the instructors immediately.
- Add password.R and any other unneeded file names and file types to the .gitignore file to prevent them from syncing to the team repository. This can include cache folders that can become quite large and do not need to be on the repo
- Use the R specific .gitignore file courtesy of GitHub
- Ensure that your final repository is up-to-date and anyone can run project code on their computer without having to change working directories or other settings by using relative file paths via the here::here() package (this will be discussed in labs)
- Report the R environment and package versions used to create the analysis by utilizing the renv package (this will also be discussed in labs)
- Update README.md files with details of the contents of repo folders
- Select a license for your project. The MIT License is a popular one for this project, but feel free to choose one amongst your team. You will also want to be sure that you include all of your teammates’ names on the license. Instructions for how to do this are below:
- Ensure that all GitHub commits have useful names and clear descriptions
- Good example: ‘Create .rds file that stores original and final predictive model’
- Bad example: ‘Updated files’
- In general, good commits start with a present tense verb that summarizes your work in 50 character or less. If the instructors can’t tell exactly what change/edit was made, neither will you/your team six months from now.
- Review file naming and folder structure conventions for entire repo, below are some guidelines:
- Never capitalize unless necessary or helpful for emphasis
- Use an underscore (_) instead of a space in file and folder names since spaces can cause a lot of problems in paths and are usually replaced with arbitrary characters.
- Try to have short but meaningful file names that are memorable
- Add a README.md file inside of each folder with, at minimum, a one-sentence description of the directory.
- When order matters (for example steps in analysis or chapters of a report) the file order matches order in which they should be run something like step_01, step_02, etc.
- Use consistent naming throughout project, including rules for capitalization and dates
- Effective use of leading zeros to maintain proper file order (09, 10, 11, … )
- Create final project website that contains the following (there will be a tutorial to assist with this in Wk 06-07):
- An index.md file. You will create an index.md file to serve as the landing page for the report, and link to report chapters from that document. Individual chapters should be stored as separate RMD/MD files.
- Website must be active and live
- All links must work properly
- GitHub pages template must be clean and effective (Suggested templates/examples will be provided)
- If custom CSS is used, it must also be consistent for report style
- Landing page must include links to the following:
- A table of contents
- A link to files on the GitHub repo with description of content
- Replication instructions (software needed, how to access files, etc)
- License info for project (copy from repo License or link to repo license)
- About Us page
- Add any additional tasks you feel are useful after reviewing the final project rubric
It can take some practice to learn the best way to break complex operations down into discrete tasks. However, think about it like wedding planning - sending out invitations is one task, booking a venue another, etc. But you can break down a task like invitations much further:
- create invite list
- acquire addresses
- track RSVPs
- finalize attendance list
You can create one large task with a check-list of sub-tasks, or a set of distinct tasks. Both are viable ways to organize the work.
Note that in GitHub Kanban Boards a check-box is created like this:
* [ ] finalize attendance list
And to close it add an X to the box:
* [X] finalize attendance list
Once your tasks are created, assign immediate tasks to each person, add their names to their cards, and move to Doing
Finally, update the board each week with completed tasks, new assignments, and new tasks
Resources
- Read Project Rubric Website https://watts-college.github.io/cpp-528-template/project/project_rubric.html
- Project Rubric Doc https://docs.google.com/document/d/1ZJh3p8x2mX96Fuarl-q6_Dy8DeANXeRV/edit?usp=sharing&ouid=116435783742528000787&rtpof=true&sd=true
- Final Project Guide https://r-class.github.io/cpp-528-example-repo/Final-Project-Guide.html
- Read Github guide on project boards: https://docs.github.com/en/issues/planning-and-tracking-with-projects/managing-items-in-your-project/adding-items-to-your-project
- Review how to work with labels with GitHub Labels Instructions: https://docs.github.com/en/issues/using-labels-and-milestones-to-track-work/managing-labels
- Watch lecture video above for step-by-step instructions
Project Data Steps File
In addition to creating your Kanban board this week, you will also want to create a .R file and name it project_data_steps_[your initials/name].R (for example, my file labeled with my initials would be project_data_steps_CS.R
). Save the file to the analysis
folder.
In this file, you can copy and paste the following code and update it with your information. Ensure that you have a variable called author
that has your name assigned to it and a variable called census_division
with your selected census division assigned to it.
Code
#
# Author: Courtney Stowers
# Date: December 11, 2023
# Purpose: Create custom functions to process data for SVI Tax Credit Project
#
# Library ----
library(here)
library(tidyverse)
library(stringi)
library(kableExtra)
library(tidycensus)
# Variables ----
author <- "Courtney Stowers"
census_division <- "Middle Atlantic Division"
You can check the census_regions.xlsx
data file in the raw folder for the exact naming convention of your division (note we can load a package using double colons to specify it in front of a function readxl::read_excel()
:
Code
# A tibble: 9 × 1
Division
<chr>
1 New England Division
2 Middle Atlantic Division
3 East North Central Division
4 West North Central Division
5 South Atlantic Division
6 East South Central Division
7 West South Central Division
8 Mountain Division
9 Pacific Division
Next create a password.R file following the instructions from the Data & Passwords section above and save it to the analysis folder with your project_data_steps.R file.
In your .RMD file, load your census_division variable using the import & here packages and your API key using the source() function:
Code
[1] "Middle Atlantic Division"
Code
# Load API key, assign to TidyCensus Package, remember do not print output
source(here::here("analysis/password.R"))
census_api_key(census_api_key)
Once we have our census API key, we can load data from the American Community Survey’s subject tables.
Code
census_variables <- load_variables(2020, "acs5/subject", cache = TRUE)
census_variables %>% head() %>% kbl() %>% kable_styling() %>% scroll_box(width = "100%")
name | label | concept |
---|---|---|
S0101_C01_001 | Estimate!!Total!!Total population | AGE AND SEX |
S0101_C01_002 | Estimate!!Total!!Total population!!AGE!!Under 5 years | AGE AND SEX |
S0101_C01_003 | Estimate!!Total!!Total population!!AGE!!5 to 9 years | AGE AND SEX |
S0101_C01_004 | Estimate!!Total!!Total population!!AGE!!10 to 14 years | AGE AND SEX |
S0101_C01_005 | Estimate!!Total!!Total population!!AGE!!15 to 19 years | AGE AND SEX |
S0101_C01_006 | Estimate!!Total!!Total population!!AGE!!20 to 24 years | AGE AND SEX |
We can then pull our data from the ACS to view the overall total population, male population, and female population, filter to our division, and join the ACS data with census variable details:
Code
# Join data set with census_variable df
left_join(acs_pull, census_variables, join_by("variable" == "name")) %>% kbl(format.args = list(big.mark = ",")) %>% kable_styling() %>% scroll_box(width = "100%")
GEOID | NAME | variable | estimate | moe | label | concept |
---|---|---|---|---|---|---|
2 | Middle Atlantic Division | S0101_C01_001 | 41,195,152 | NA | Estimate!!Total!!Total population | AGE AND SEX |
2 | Middle Atlantic Division | S0101_C03_001 | 20,084,449 | 1,631 | Estimate!!Male!!Total population | AGE AND SEX |
2 | Middle Atlantic Division | S0101_C05_001 | 21,110,703 | 1,632 | Estimate!!Female!!Total population | AGE AND SEX |
Lab Submission Instructions
Congratulations! You’ve reached the end of the Lab-01 Tutorial!
You are now ready to complete your lab and submit it on Canvas.
The following checklist will ensure that you’re on track: