Data Science

Data Science is a HUGE field, and it has been called one of the most highest paying jobs for the coming years. It’s a long journey, but here’s a way to get started with it!

Please Read First
An Introduction
Technologies
General
Getting Started
People To Follow
FAQ
Learning Materials
References

Please Read First

This guide is meant to present Data Science as a full stack field, not limited to just building models, but rather exploring the market possibilities and the tools that fortune 500 companies to small scale companies use for to make business related decisions for the agile moto of quick to market. Take this guide to not only getting good at making models for data, but rather diving deep into how you think about handling data in anycase, and in the different ways you can easily incorporate data science to make some easy predictions!

An Introduction

Data Science refers to the statistical analysis of the data for patterns and trends to make business decisions and to teach computers to recognize that data to give definitive answers. It’s a huge field which can be used to answer questions or interesting ideas like,

How do I teach a computer to recognize an image?
How can I train a computer bot to talk like a human?
How can I train a computer to accurately distinguish between two datasets?

Of course it’s not immediately visible about how we can do all this, but what we can is break this huge field in several components and see how they all help.

Machine Learning, a field basically meant to provide statistical insights into building statistical models that can provide some accuracy to answer questions such as providing a prediction of some value ( regression ) or classifying data into yes or no ( classification ).
Deep Learning, popularized because of Neural Networks, a way to identify and extract features from a given data, and allowing that to more accurately provide insights into problems of Speech ( Recurrent Neural Networks, NLP ), or Images ( OCR, Convoluational Neural Networks ).
Data Analysis, a kind of sub field in Data Science to provide visual understanding of the data at hand and to show the trends in data
Data Engineering, a way to develop data pipelines, maintaing proper data sets and schemas, and to understanding who has acces to what data and how. Responsible for keeping check for the out flow and in flow of data, and for providing insights into the data streaming in and out
Artifical Intelligence, the grass root that motivated Data Science to move away it from Statistics And Probability into a complete field of its own. It’s a way of thinking that presents the main theoratical models that has inspired the idea of teching computers to think and act like humans.

Technologies

There are a lot of technologies for each and every sub field of data science, language wise, you can use,

SQL
R
Python

To extract patterns and the required data. Libraries? R has a ton of them for statistical measures. Python has them, too. Some common ones from Python are.

Tensorflow
ScikitLearn

General

Getting Started

Domingos, Pedro. “A few useful things to know about machine learning.” Communications of the ACM 55, no. 10 (2012): 78-87
Shewchuk, Jonathan Richard. “An Introduction to the Conjugate Gradient Method Without the Agonizing Pain.” 1994
To understand cost functions better An Introduction To Understanding Cost Functions

People To Follow

The following people are the big guys in Machine Learning and Deep Learning. You can follow them on twitter and quora. It gives a peek into their lives and how they think. And sometimes their discussions are a great source of learning too.

Geoffrey Hinton
Yoshua Bengio
Yann LeCunn
Andrew Ng
Christopher Manning
Andrej Karpathy
Ian Goodfellow
Fei Fei Li
Nando de Freitas
Jeremy Howard
Rachel Thomas
Pieter Abeel
Richard Socher
Francois Chollet
Soumith Chintala

List courtesy of FAST’s leading AI society, AIMLC, under ACM

FAQ

So what’s the difference between Data Science and Data Engineering?
Skills or certificates?
What else should I look into? Is Data Science is a complete field in it of itself?
Who should I follow for Data Science?
What’s Data Mining?!
Should I really good at Maths to get started with anything at all?
I’m not a very good coder. What can I do?
I’m not a research kind of person. What can I do?
I’m just a student! What can I do?

Learning Materials

Books To Look Into

Building Machine Learning Pipelines by Hannes Hapke & Catherine Nelson - August 2020
Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville
Hands On Machine Learning With Scikit Learn, Keras, And Tensorflow – O’Reilly, Aurelien Geron
Pattern Recognition And Machine Learning, Christopher M. Bishop
An Introduction to Statistical Learning – Trevor Hastie
Machine Learning: A Probabilistic Perspective – Kevin P. Murphy
Elements of Statistical Learning – Trevor Hastie
Deep Learning – Ian Goodfellow, Yoshua Bengio

Misc Books

Automate The Boring Stuff With Python, Practical Programming For Total Beginners, Al Sweigart

Competitions

Kaggle is love! <3>

Blogging

If you like writing blog posts for you data science findings, try out these

With FastPages ( Jupyter Notebook )
With Github Pages, see this, and here
Medium

Research Purposes

Arxiv Sanity
Made With ML
Papers With Code
Kaggle / Dataquest / Google / Open Images / Big Bad NLP / Awesome Data Labelling Datasets
IEEE papers

Artifical Intelligence Concept List

CS50’s Artificial Intelligence With Python
- Graph search, Heuristic search, Adversarial search, Alpha-Beta Pruning
- Knowledge Representation, Propositional Logic, Inference, Resolution, First-Order Logic
- Probability, Random Variables, Probabilistic Inference, Bayesian Networks, Markov Networks
- Local Search, Hill Climbing, Constraint Satisfaction, Backtracking Search
- Classification, Regression, Support Vector Machines, Reinforcement Learning Clustering
- Feed-Forward Networks, Back-Propagation, Convolutional Networks, Recurrent Networks
- Context-Free Grammar, N-Gram Models, Naïve Bayes, tf-idf, word2vec

Full Stack Data Science

Machine Learning + Deep Learning Topics

Methods
- Dimensionality Reduction
- Ensemble Learning
- Meta Learning
- Reinforcement Learning
- Supervised Learning
- Unsupervised Learning
- Semi-supervised Learning
- Deep Learning
- Anomaly Detection
- PAC Learning
- Regression
- Statistical Learning
- Structural Prediction
- Feature Learning
- Feature Engineering
- Unsupervised Learning
- Bias-Variance Dilemma
- Association Rules
Pretrained Models ( for transfer learning )
- Tensorflow Hub
- PyTorch Hub
- HuggingFace Transformers
- Detectron 2
Experiment Tracking
- TensorBoard, track and visualize metrics, view model graphs, look at images, text and audio data, integrated with TensorFlow and PyTorch
- Dashboard by Weights & Biases
- Neptune.ai
Data And Model Tracking
- Artifacts by Weights & Biases
- DVC ( Data Version Control )
Cloud Computer Services
- AWS – Sagemaker
- GCP – AI Platform

( P.S, Maybe invest in hardware? )

Machine Learning frameworks and libraries in C++

This section is courtest of @Andrew Ng!

mlpack: a scalable C++ machine learning library.
SHARK: a fast, modular, feature-rich open-source C++ machine learning library.
Dlib-ml: A Machine Learning Toolkit.
Waffles: A collection of command-line tools for researchers in machine learning, data mining, and related fields. All of the functionality is also provided in a clean C++ class library.
MLC++: a library of C++ classes for supervised machine learning.

Datasets

Mathematical Background

Linear Algebra a. Essence of Linear Algebra - 3blue1brown (YouTube) _ b. MIT 18.06 Linear Algebra – Gilbert Strang _ c. Coding the matrix – Brown University
Probability and Statistics a. MIT 6.041 Probabilistic Systems Analysis and Applied Probability – John Tsitsiklis _ b. Statistics 110 – Harvard University c. Statistics with R – Duke University (Coursera) d. Discovering Statistics using SPSS – Andy Field (Book) _
Calculus a. Essence of Calculus – 3blue1brown (YouTube) _ b. MIT Highlights of Calculus – Gilbert Strang _ c. MIT 18.01 Single Variable Calculus – MIT OCW d. MIT 18.02 Multivariable Calculus – MIT OCW

Reinforcement Learning

Deep RL Bootcamp, https://sites.google.com/view/deep-rl-bootcamp/lectures
Dennybritz - Reinforcement Learning, https://github.com/dennybritz/reinforcement-learning
Incomplete Ideas – Reinforcement Learning, http://www.incompleteideas.net/book/RLbook2018.pdf
RL Course by David Silver, Lecture 1, https://www.youtube.com/watch?v=2pWv7GOvuf0

Data Engineering / Analysis

Technology Stack
- Query Languages: SQL
- Programming Languages: Python, Go , R
- Data Warehousing: Amazon S3, Hive, HDFS, HBase
- Data Store: DynamoDB, Hadoop, MongoDB, Postgresql
- Data Infrastructure: AWS, GCP
Data Handling Frameworks
- Apache Spark + Apache Flink ( Stream Processing )
- Apache Beam + Apache Kafka ( Batch, Parallel Processing )
Linux: Cron Jobs, ETL
Data Visualization: Tableau

Courses

Introduction To Machine Learning, Coursera, Andrew Ng
Deeplearning.Ai Specialization, Coursera, Andrew Ng
Tensorfow in Practice, Coursera, George
Data Engineering With Google cloud Certificate, Coursera
The UNIX workbench, Coursera
Data Science: Statistics And Machine Leaning Specialization, Coursera
Data Science: Foundations Using R Specialization, Coursera
DevOps Culture And Mindset
Machine Learning With Tensorflow On Google Cloud Platform
FAST AI MOOC, Part I and II
CS231n, A MIT Open Course specifically for Computer Vision
CS224n, A MIT Open Course specifically for Natural Language Processing
Mathematics for Machine Learning Specialization
Depth First Learning
CS224w ML with Graphs
CS229 Machine Learning - Stanford - This is the Stanford CS course on Machine Learning that Prof Ng has taught for a number of years. The material parallels the Coursera course, but covers some additional topics and goes into much more depth on the mathematics.
Cornell Virtual Workshop
Dive Into Machine Learning

Interesting Applications

Castillo, Carlos, Marcelo Mendoza, and Barbara Poblete. “Information credibility on Twitter.” In Proceedings of the 20th international conference on World wide web, pp. 675-684. ACM, 2011.
Norman, Kenneth A., Sean M. Polyn, Greg J. Detre, and James V. Haxby. “Beyond mind-reading: multi-voxel pattern analysis of fMRI data.” Trends in cognitive sciences 10, no. 9 (2006): 424-430.
Pereira, Francisco, Tom Mitchell, and Matthew Botvinick. “Machine learning classifiers and fMRI: a tutorial overview.” Neuroimage 45, no. 1 Suppl (2009): S199.

Youtube Channels

For Causal Inference, I’d highly recommend @mattmasten’s Causal Inference bootcamp. Over 100 videos to understand ideas like counterfactuals, instrumental variables, differences-in-differences, regression discontinuity… (from an econ/ss perspective) /10Mod•U: Powerful Concepts in Social Science

Courtesy of @sannykimchi

MISC

Information that doesn’t really fall into a concrete place here,

Are you ever confused about cluster computing, containers or scaling experiments? Then Stanford’s @stats285 might be a great way to better understand cloud computing, distributed tools and research infrastructure /8Massive Computational Experiments, Painlessly

Courtesy of @sannykimchi

References

A mind map of the things listed above, thanks to Daniel Bourke, you can check out his YT channel here
Data Engineering, comprehensive list to get started
See for Airbnb / Airflow related articles, here at A Beginner’s Guide To Data Engineering, and here at Using Machine Learning To Predict Value Of Homes On Airbnb
Machine Learning Crash Course, With TensorFlow APIs