ECON 413
Introduction to Data Science

Erol Taymaz
Department of Economics
Middle East Technical University

Topics

What is “Data Science”?

Data science, also known as data-driven science, is an interdisciplinary field about scientific methods, processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured …

Data science is a “concept to unify statistics, data analysis and their related methods” in order to “understand and analyze actual phenomena” with data. It employs techniques and theories drawn from many fields within the broad areas of mathematics, statistics, information science, and computer science, in particular from the subdomains of machine learning, classification, cluster analysis, data mining, databases, and visualization. (Wikipedia/Data Science)

What is “data”?

What is “data”?

What is “data”?

Why is data science important?

Why is data science important?

The 50 Best Jobs in America in 2021

  1. Java Developed
  2. Data Scientist
  3. Product Manager
  4. Enterprise Architect
  5. DevOps Engineer
  6. Information Security Engineer
  7. Business Development Manager
  8. Mobile Engineer
  9. Software Engineer
  10. Dentist

Source: Glassdoor/50 Best Jobs in America

Data Science Process

What is ECON 413?

Textbooks

Topics

Part 1. Basics

  1. Introduction

  2. Data types and data objects

  3. Algorithms, loops, functions

  4. Basic functions

  5. Data manipulation with data.table

  6. Data visualization and ggplot2

  7. Factors, lists, functionals

Part 2. Applications

  1. Reproducible and interactive research (Rmarkdown and Shiny packages)

  2. Web scrapping and text analysis

  3. Regression analysis

  4. Maps

  5. Animations and simulations

  6. R best practice

  7. Review

Lectures

Please review the presentation slides and try examples before the lecture.

Grading

The course consists of lectures, quizzes, homeworks and projects.

Course grades will be based on 5 quizzes (8 pts each), 5 homeworks (5 pts each), 1 project (25 pts), and classroom participation (10 pts). There will be no make-up.

Quizzes will be given sequentially. In order to take the next quiz, you have to pass the previous one (at least 6 points out of 8). A given quiz can be taken at most 4 times, and you can take at most 10 quizzes (including re-attempts) in total. If you pass a quiz after the second or later attempts, you will get 6 points for that quiz.

Classroom participation will be based on your responses to i) questions during lectures, ii) “Short Questions” at ODTUClass, and iii) other students’ questions on the “Forum” page.

The project teams will consist of 4 students. Projects will be submitted by February 9, 2022, midnight, and presented on February 10, 2022.

DataCamp

“This class is supported by DataCamp, the most intuitive learning platform for data science and analytics. Learn any time, anywhere and become an expert in R, Python, SQL, and more. DataCamp’s learn-by-doing methodology combines short expert videos and hands-on-the-keyboard exercises to help learners retain knowledge. DataCamp offers 350+ courses by expert instructors on topics such as importing data, data visualization, and machine learning. They’re constantly expanding their curriculum to keep up with the latest technology trends and to provide the best learning experience for all skill levels. Join over 6 million learners around the world and close your skills gap.”

I will register your name at DataCamp. If you do not want to use it, please inform me by e-mail.

What is R?

What is R?

Why R?

Why not Stata, or SPSS, or …?

Why R?

Why R?

Why R?

1-year forecasts for US/TL

Assume that we need to forecast $/TL exchange rate and report results every week

Data science process - the usual one

Assume that we need to forecast $/TL exchange rate and report results every week

Data science process with R

Only 7 lines of code

library(CBRT)
library(forecast)

myData <- getDataSeries("TP.DK.USD.A.EF", start = "2015-01-01", freq = 3)
usd <- ts(myData$TP.DK.USD.A.EF,  frequency = 52, start = c(2020, 1))
musd <- auto.arima(usd)
fusd <- forecast(musd, h= 52)
autoplot(fusd) + theme_bw()

Data science process with R

To do this week

File organization

Using RStudio

Make errors!

Note that you will make errors frequently when you start using R. Do not get frustrated when you get error messages. It is an essential part of the learning process. Therefore, try to fix these errors by yourself.

If you get any error message anytime while using R, check the code first. Most of these errors will be due to missing parentheses and commas.

If you cannot solve the problem in a reasonable time, submit a question at the Forum page of ODTUClass. When you submit your question, please add the error message and provide sufficient info to reproduce the error.

Try to solve the problems/errors posted on the Forum page, and share your solutions with others. This is one of the best methods to learn R programming.

Search for the error in Google (just copy the error message to the Google search bar), and try to find an answer (among the search results, first check Stackoverflow sites).





Good luck!