ECON 413
Introduction to Data Science
Erol Taymaz
Department of Economics
Middle East Technical University
Topics
- What is “Data Science”?
- Why is data science important?
- The data science process
- What is R?
What is “Data Science”?
Data science, also known as data-driven science, is an interdisciplinary field about scientific methods, processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured …
Data science is a “concept to unify statistics, data analysis and their related methods” in order to “understand and analyze actual phenomena” with data. It employs techniques and theories drawn from many fields within the broad areas of mathematics, statistics, information science, and computer science, in particular from the subdomains of machine learning, classification, cluster analysis, data mining, databases, and visualization. (Wikipedia/Data Science)
What is “data”?
- Everything we can get information from
- Everything that can be stored in a computer memory!
What is “data”?
- Penn World Tables [tabular data, relational data]
- 5-Year Development Plan
- www.sahibinden.com
- Map of Turkey
What is “data”?
Why is data science important?
An explosion of data and data sources
Everybody and everything generates data everytime
- Walking around by mobile phone
- Using your credit card
- Browsing the Internet
- Buying a certain product
Easy access to the data (open data sources, Wikipedia!)
Ability to extract value from the data - not just our own data, but all of the available data
Ability to use the tools necessary to collect, analyze and present the data
Why is data science important?
The 50 Best Jobs in America in 2021
- Java Developed
- Data Scientist
- Product Manager
- Enterprise Architect
- DevOps Engineer
- Information Security Engineer
- Business Development Manager
- Mobile Engineer
- Software Engineer
- Dentist
Source: Glassdoor/50 Best Jobs in America
Data Science Process
What is ECON 413?
- An introduction to data science
- An introduction to the main tools and ideas in the data scientist’s toolbox.
- How to collect, retrieve, scrap, check, clean, analyze, understand, visualize, and present the data
- An introduction to R programming
Textbooks
- Venables, W. N., Smith, D. M. and the R Core Team (2015), An Introduction to R, R Core Team.
- Grolemund, Garrett, and Wickham, Hadley (2017), R for Data Science, O’Reilly.
- Wickham, Hadley (2014), Advanced R, Chapman & Hall/CRC.
- Grolemund, Garrett (2014), Hands-On Programming with R, O’Reilly.
- Peng, Roger D. (2015), R Programming for Data Science, Leanpub.
- Hanck, C., Arnold, M., Gerber,A. and Schmelzer, M. (2020), Introduction to Econometrics with R
- Wilke, Claus O. (2021), Fundamentals of Data Visualization
- Healy, Kieran (2018), Data Visualization: A practical Introduction
Topics
Part 1. Basics
Introduction
Data types and data objects
Algorithms, loops, functions
Basic functions
Data manipulation with data.table
Data visualization and ggplot2
Factors, lists, functionals
Part 2. Applications
Reproducible and interactive research (Rmarkdown and Shiny packages)
Web scrapping and text analysis
Regression analysis
Maps
Animations and simulations
R best practice
Review
Lectures
- Monday (09:40-11:30), online
- Friday (09:40-11:30), computer lab
- Group 1: A-HA
- Group 2: HE-Z
Please review the presentation slides and try examples before the lecture.
Grading
The course consists of lectures, quizzes, homeworks and projects.
Course grades will be based on 5 quizzes (8 pts each), 5 homeworks (5 pts each), 1 project (25 pts), and classroom participation (10 pts). There will be no make-up.
Quizzes will be given sequentially. In order to take the next quiz, you have to pass the previous one (at least 6 points out of 8). A given quiz can be taken at most 4 times, and you can take at most 10 quizzes (including re-attempts) in total. If you pass a quiz after the second or later attempts, you will get 6 points for that quiz.
Classroom participation will be based on your responses to i) questions during lectures, ii) “Short Questions” at ODTUClass, and iii) other students’ questions on the “Forum” page.
The project teams will consist of 4 students. Projects will be submitted by February 9, 2022, midnight, and presented on February 10, 2022.
DataCamp
“This class is supported by DataCamp, the most intuitive learning platform for data science and analytics. Learn any time, anywhere and become an expert in R, Python, SQL, and more. DataCamp’s learn-by-doing methodology combines short expert videos and hands-on-the-keyboard exercises to help learners retain knowledge. DataCamp offers 350+ courses by expert instructors on topics such as importing data, data visualization, and machine learning. They’re constantly expanding their curriculum to keep up with the latest technology trends and to provide the best learning experience for all skill levels. Join over 6 million learners around the world and close your skills gap.”
I will register your name at DataCamp. If you do not want to use it, please inform me by e-mail.
What is R?
- A programming language (interpreter)
- A statistics package
- An environment for statistical computing and graphics
- Developed by a community
- Extended with ‘packages’ that contain data, code, and documentation
What is R?
- R is a flavor of the S computer language
- S was developed by John Chambers at Bell Labs in the late 1970s
- [W]e wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important. (John Chambers)
- 1991 R is created by Ross Ihaka and Robert Gentleman
- 1993 R is made public
- 1995 R becomes Open Source (GNU General Public License)
- 1997 R Core Group is formed
- 2000 Version 1.0.0 ships
Why R?
Why not Stata, or SPSS, or …?
- A flexible programming language
- Open Source and free (philosophy, or practical reasons)
- It is free to study the code
- It is free to redistribute it
- It is free to modify it
- It is free to redistribute the modified version
- It is free of charge, too
- Platform independent (Linux, Mac, Windows, desktops, servers)
- Easy to share with others
- Suitable for almost all scientific disciplines
- Extensive, diverse and growing community
- Ever increasing number of packages
- Easy interactions with other programs (web, presentation, databases, big data, etc.)
- Part of the open source toolchain of research (from data analysis to reporting, for the web or a thesis)
Why R?
- Advantages
- Free of charge, easy to install
- Strong community support
- Up-to-date
- Can complement / can be complemented by other programs
- Requires user knowledge - user thinks about what s/he is doing
- Drawbacks
- Steep learning curve
- Not user friendly
- All objects are stored in the computer memory
- Slower than compiled languages
- Easy to make mistakes, difficult to find the sources of mistakes
Why R?
- Data Science and Big Data The R statistical programming language has shown consistent growth, as has pandas, a popular library for data science in Python. The closed source MATLAB language was growing for most of the lifetime of the site, but has more recently leveled off and may be shrinking. TensorFlow, Google’s open-source machine learning framework, was introduced only in late 2015, but it’s been growing at an extraordinary pace. Source: David Robinson, Introducing Stack Overflow Trends, May 9, 2017
Why R?
- Number of R packages available on its main distribution site (Cran, Comprehensive R Archive Network)
1-year forecasts for US/TL
Assume that we need to forecast $/TL exchange rate and report results every week
- Download the data from CRBT web site
- Prepare the data file
- Analyze the data
- Prepare the chart
- Make a presentation file
- Write the report
- Reproduce it
Data science process - the usual one
Assume that we need to forecast $/TL exchange rate and report results every week
- Download the data from CRBT web site [web]
- Prepare the data file [Excel]
- Analyze the data [EViews, Stata, etc.]
- Prepare the chart [Excel]
- Make a presentation [PowerPoint]
- Write the report [Word]
- Reproduce it [do it again]
Data science process with R
Only 7 lines of code
library(CBRT)
library(forecast)
myData <- getDataSeries("TP.DK.USD.A.EF", start = "2015-01-01", freq = 3)
usd <- ts(myData$TP.DK.USD.A.EF, frequency = 52, start = c(2020, 1))
musd <- auto.arima(usd)
fusd <- forecast(musd, h= 52)
autoplot(fusd) + theme_bw()
Data science process with R
To do this week
- Install R
- Install RStudio
- Register at CBRT’s Electronic Data Delivery System in order to get access to the CBRT web service. Note that registration is free and open to the public.
- Install the following packages:
- CBRT, data.table, ggplot2, ggthemes, pwt9, rmarkdown, WDI, DT, forecast, foreign, ggally, haven, leaflet, leaflet.extras, mice, networkD3, plm, plotly, r2d3, rgdal, rvest, sf, shiny, shinyWidgets, stargazer, tesseract, tidyverse
File organization
- Organize files in directories
- /econ413
- /econ413/R files
- /econ413/Data
- /econ413/Raw data
- /econ413/img
- Use consistent file names
- 01 Project web data collect.R
- 02 Project WDI data collect.R
- 10 Project descriptives.R
- 20 Project regressions.R
- R code (R script) will be saved in /R files
Using RStudio
Make errors!
Note that you will make errors frequently when you start using R. Do not get frustrated when you get error messages. It is an essential part of the learning process. Therefore, try to fix these errors by yourself.
If you get any error message anytime while using R, check the code first. Most of these errors will be due to missing parentheses and commas.
If you cannot solve the problem in a reasonable time, submit a question at the Forum page of ODTUClass. When you submit your question, please add the error message and provide sufficient info to reproduce the error.
Try to solve the problems/errors posted on the Forum page, and share your solutions with others. This is one of the best methods to learn R programming.
Search for the error in Google (just copy the error message to the Google search bar), and try to find an answer (among the search results, first check Stackoverflow sites).