Introduction to Data Science

STAT 220

Bastola

Something about me

Third year at Carleton
Originally from Nepal
Research in Bayesian computation and machine learning
Avid learner and traveler

My website

What is data science?

Data science is the application of computational and statistical techniques to gain insight into some problem in the real world

\[ \begin{align*} \text{Data Science} &= \text{scientific inquiry } +\\ & \quad \text{ data collection } +\\ & \quad \text{ data processing } +\\ & \quad \text{ visualization } +\\ & \quad \text{ statistics } +\\ & \quad \text{ machine learning } +\\ & \quad \text{ communication } \end{align*} \]

Data Science in a Nutshell

Image adapted from work of Joe Blitzstein, Hanspeter Pfister, and Hadley Wickham

Introduction to Data Science

Focus on the “soup to nuts” approach to problem solving

data wrangling
- reshaping, cleaning, gathering
learning from data
- EDA tools
- statistical learning methods
communication
- reproducibility
- effective visualization

Gendered language in professor reviews

Rate my professor reviews

Active Participation and Experimentation

Engage Actively with provided .Rmd documents during class
- Use these for note-taking and running code live.
Ask Questions
- Every question is valuable; we’re learning together.
Experiment
- Trial and error is key; explore and experiment with the code.

Tell me something about yourself!

Your name?
Gender Pronouns?
Why are you interested in data science?
Your favorite data scientist/scientist?

Class Pipelines

https://stat220-spring24.netlify.app

Please bookmark this page: should be checked for tutorials and solutions to class activities.
Most of the course information and schedule will be posted in moodle
Use moodle/Gradescope for submitting class activity, homework and seeing grades
Github Class Organization hosts all the files relevant for the course

Necessary skills to be mastered

programming with data
statistical modeling
domain knowledge
communication

What will a typical day/week look like?

Before class:
- Some reading/video to introduce some topics
- Work on homework/projects, come with questions
During class:
- Mini lectures
- Class activities

What you need to do next . …

read the Posit for Stat220 page
read the GitHub for Stat220 page
read the Software for Stat220 page
read the Homework Guidelines for Stat220 page

R Vs Python for data science

“R is written by statisticians, for statisticians,” — Norm Matloff, Author of The Art of R Programming, Prof. of Computer Science, UC Davis

Advantages of R over Python:

Not so steep learning curve as Python
R has many generic functions that are universal, e.g. print(), plot(), summary()
R Comprehensive R Archive Network (CRAN) has many user-friendly packages
R’s basic help() and example() functions are much more informative than Python’s counterparts

Using R Markdown for Data Science

In this class, we will use R Markdown for all our work, leveraging its comprehensive support for data science projects.
A R Markdown document (.Rmd) integrates:
- R code for dynamic analysis and visualization.
- Descriptive text
Compiling (rendering) a .Rmd file produces various output formats
- Documents: PDF, HTML, Word.
- Presentations: HTML-based slides, PDF slides via Beamer.
- Interactive content: Web dashboards, interactive visualizations.
R Markdown is engineered with reproducibility as a core principle, facilitating transparent and repeatable research workflows.
The presentations for this class is made using Quarto Markdown

Version Control using Git and GitHub

User: A Github account for you (e.g., deepbas).
Organization: The Github account for one or more user (e.g., DataScienceSpring24).
Repository: A folder within the organization that includes files dedicated to a project.
Local Github: Copies of Github files located your computer.
Remote Github: Github files located on the https://github.com website.
Clone: Process of making a local copy of a remote Github repository.
Pull: Copy changes on the remote Github repository to your local Github repository.
Push: Save local changes to remote Github

Using GitHub and Posit for data science

Integrate R Markdown with GitHub for version control:
- Create a GitHub repository.
- Clone the repo and set up a R Markdown project in RStudio.
- Work with various files (.Rmd, .r, .csv, etc.).
- Commit changes locally and push to GitHub.
- Pull updates from others into your local workspace.

R Markdown enhances the workflow by seamlessly integrating executable code with narrative text, making your data science projects reproducible and collaborative.

Glimpse into the course

Plot
Code

library(babynames)
your_name <- "Dee"
your_name_data <- babynames %>% filter(name == your_name)

ggplot(data=your_name_data, aes(x=year, y=prop)) +
  geom_point(size = 3, alpha = 0.6) +
  geom_line(aes(colour = sex), size = 1) +
  scale_color_brewer(palette = "Set1") +
  labs( x = 'Year',
        y = stringr::str_c('Prop. of Babies Named ', your_name),
        title = stringr::str_c('Trends in Names: ', your_name))

Make a course folder called ‘stat220’ either on your Maize account or on your local computer
Please download the Class-Activity-1 template from moodle and go to class helper web page

10:00