Introductions

About the course

  • Science: systematic organization & creation of knowledge, in form of testable explanations/predictions about the world

  • Data Science: field that uses scientific methods to extract knowledge & insights from data
    • Scientific method invloves making conjectures and testing them using observation or experimentation.


  • In short, this course is about answering questions using data.

About the course

  • Unfortunately, cannot answer every question using data.


  • But certain types of questions can be reasonably addressed with data
    • Estimation: What % of the population gets the flu?
    • Inference: Should I take drug X for the flu?
    • Prediction: Will I get the flu?

Data Science in Action

Course Goals

  • Be able to conduct a systematic investigation of a question/problem using data. In particular:
    • Organize & manipulate data
    • Explore & investigate data graphically
    • Deal with variability & estimation
    • Formulate & test statistical hypotheses
    • Discover patterns in data & use them for prediction
    • Develop strong statistical computing skills
    • Communicate statistical ideas & results effectively

Resources

Course Evaluation

Item Weight Notes
Worksheets 15% semi-weekly, best 20/23
Project 20% progress reports & final presentation
Midterm 25% computer-based
Final 40% computer-based

Course Project

  • Project Topic Is university education worth it ?

  • Purposefully open-ended question; answer various aspects using data
    • Collect and examine relevant data
    • Formulate interesting questions
    • Use data to address them convincingly

Course Project

  • Hope you don't drop out based on your analysis :)

Lecture Goals

  • Getting started with R/RStudio
    • Using the RStudio IDE
    • Quick run-through of R
    • Writing RMarkdown reports
  • Readings

RStudio

RStudio Workflow

  • Create folder and associated Rstudio project
    • Container for your data, code, and output files


  • Write code (.R/Rmd scripts) in Editor
  • Run commands/scripts, interact with R in Console
  • Check available variables/data in Environment
  • Look at output/documentation/files in Viewer

R Philosophy

  • Information is contained in objects
    • E.g. data, variables, models, plots
  • Operations are performed by functions
    • E.g. sorting data, fitting models, plotting results
  • Carry out analysis by applying functions to objects
my_numbers = 1:3
sum(my_numbers)
## [1] 6

R packages

  • R comes pre-loaded with basic functions; for extra functionality there are two options:
    • Create your own: R programming
    • Use someone else's: R Packages
  • R Packages are bundles of reusable functions, data, & documentation
    • Packages must be downloaded once w/ install.packages()
    • To use a package, you must load it into R session w/ library()
install.packages("package_name")
library(package_name)

Getting Help

  • Get function documentation using help() or ?
    • See description, arguments, results & examples
help(fun_name)  
? help 
  • For help on packages
help( package = "package_name" )

Input/Output

  • Read/write tabular data from/to spreadsheet-like file (.csv or comma separated values)
my_data = read.csv( file = "C:/Users/Sotiris/Documents/data.csv")
write.csv( my_data, file = "C:/Users/Sotiris/Documents/data.csv")
  • Read/write binary representation of select R object
save( obj1, obj2, file = "C:/Users/Sotiris/Documents/my_objects.Rdata")
  • Save all objects in in .RData file
save.image()

Directories

  • Absolute file paths ('C:/Users/Sotiris/…') are BAD for reproducibility

  • Instead, use relative paths with respect to your working directory
    • Find/set working directory with getwd()/setwd()
    • Access working directory directly with
getwd()
my_data = read.csv( file = 'data_folder/data.csv')

R Markdown

  • Rmarkdown is a framework for creating reproducible reports

  • Rmarkdown scripts combine together:
    • R code for performing data analysis
    • Markdown code for authoring documents
  • Actually, these very slides are written in Rmarkdown

  • Let's create your first report with Rmarkdown