Introduction to Data Science

STAT 220

Bastola

Something about me

  • Third year at Carleton
  • Originally from Nepal
  • Research in Bayesian computation and machine learning
  • Avid learner and traveler


My website

My website

What is data science?

Data science is the application of computational and statistical techniques to gain insight into some problem in the real world

\[ \begin{align*} \text{Data Science} &= \text{scientific inquiry } +\\ & \quad \text{ data collection } +\\ & \quad \text{ data processing } +\\ & \quad \text{ visualization } +\\ & \quad \text{ statistics } +\\ & \quad \text{ machine learning } +\\ & \quad \text{ communication } \end{align*} \]

Data Science in a Nutshell

Image adapted from work of Joe Blitzstein, Hanspeter Pfister, and Hadley Wickham

Introduction to Data Science

Focus on the “soup to nuts” approach to problem solving

  • data wrangling
    • reshaping, cleaning, gathering
  • learning from data
    • EDA tools
    • statistical learning methods
  • communication
    • reproducibility
    • effective visualization

Gendered language in professor reviews

Rate my professor reviews

Active Participation and Experimentation

  • Engage Actively with provided .Rmd documents during class
    • Use these for note-taking and running code live.
  • Ask Questions
    • Every question is valuable; we’re learning together.
  • Experiment
    • Trial and error is key; explore and experiment with the code.

Tell me something about yourself!

  • Your name?
  • Gender Pronouns?
  • Why are you interested in data science?
  • Your favorite data scientist/scientist?

Class Pipelines

https://stat220-spring24.netlify.app

  • Please bookmark this page: should be checked for tutorials and solutions to class activities.
  • Most of the course information and schedule will be posted in moodle
  • Use moodle/Gradescope for submitting class activity, homework and seeing grades
  • Github Class Organization hosts all the files relevant for the course

Necessary skills to be mastered

  • programming with data
  • statistical modeling
  • domain knowledge
  • communication

What will a typical day/week look like?

  • Before class:
    • Some reading/video to introduce some topics
    • Work on homework/projects, come with questions
  • During class:
    • Mini lectures
    • Class activities

What you need to do next . …

R Vs Python for data science

“R is written by statisticians, for statisticians,” — Norm Matloff, Author of The Art of R Programming, Prof. of Computer Science, UC Davis


Advantages of R over Python:

  • Not so steep learning curve as Python
  • R has many generic functions that are universal, e.g. print(), plot(), summary()
  • R Comprehensive R Archive Network (CRAN) has many user-friendly packages
  • R’s basic help() and example() functions are much more informative than Python’s counterparts

Using R Markdown for Data Science

  • In this class, we will use R Markdown for all our work, leveraging its comprehensive support for data science projects.
  • A R Markdown document (.Rmd) integrates:
    • R code for dynamic analysis and visualization.
    • Descriptive text
  • Compiling (rendering) a .Rmd file produces various output formats
    • Documents: PDF, HTML, Word.
    • Presentations: HTML-based slides, PDF slides via Beamer.
    • Interactive content: Web dashboards, interactive visualizations.
  • R Markdown is engineered with reproducibility as a core principle, facilitating transparent and repeatable research workflows.
  • The presentations for this class is made using Quarto Markdown

Version Control using Git and GitHub

  • User: A Github account for you (e.g., deepbas).
  • Organization: The Github account for one or more user (e.g., DataScienceSpring24).
  • Repository: A folder within the organization that includes files dedicated to a project.
  • Local Github: Copies of Github files located your computer.
  • Remote Github: Github files located on the https://github.com website.
  • Clone: Process of making a local copy of a remote Github repository.
  • Pull: Copy changes on the remote Github repository to your local Github repository.
  • Push: Save local changes to remote Github

Using GitHub and Posit for data science

  • Integrate R Markdown with GitHub for version control:
    • Create a GitHub repository.
    • Clone the repo and set up a R Markdown project in RStudio.
    • Work with various files (.Rmd, .r, .csv, etc.).
    • Commit changes locally and push to GitHub.
    • Pull updates from others into your local workspace.

R Markdown enhances the workflow by seamlessly integrating executable code with narrative text, making your data science projects reproducible and collaborative.

Glimpse into the course

library(babynames)
your_name <- "Dee"
your_name_data <- babynames %>% filter(name == your_name)

ggplot(data=your_name_data, aes(x=year, y=prop)) +
  geom_point(size = 3, alpha = 0.6) +
  geom_line(aes(colour = sex), size = 1) +
  scale_color_brewer(palette = "Set1") +
  labs( x = 'Year',
        y = stringr::str_c('Prop. of Babies Named ', your_name),
        title = stringr::str_c('Trends in Names: ', your_name)) 

 Group Activity 1

  • Make a course folder called ‘stat220’ either on your Maize account or on your local computer
  • Please download the Class-Activity-1 template from moodle and go to class helper web page

10:00