Data Science
Data Science is a HUGE field, and it has been called one of the most highest paying jobs for the coming years. It’s a long journey, but here’s a way to get started with it!
Table Of Contents
 Please Read First
 An Introduction
 Technologies
 General
 Getting Started
 People To Follow
 FAQ
 Learning Materials
 Books To Look Into
 Misc Books
 Competitions
 Blogging
 Research Purposes
 Artifical Intelligence Concept List
 Full Stack Data Science
 Machine Learning + Deep Learning Topics
 Machine Learning frameworks and libraries in C++
 Datasets
 Mathematical Background
 Reinforcement Learning
 Data Engineering / Analysis
 Courses
 Interesting Applications
 Youtube Channels
 MISC
 References
Please Read First
This guide is meant to present Data Science
as a full stack field, not limited to just building models, but rather exploring the market possibilities and the tools that fortune 500 companies to small scale companies use for to make business related decisions for the agile
moto of quick to market
. Take this guide to not only getting good at making models for data, but rather diving deep into how you think about handling data in anycase, and in the different ways you can easily incorporate data science to make some easy predictions!
An Introduction
Data Science refers to the statistical analysis of the data for patterns and trends to make business decisions and to teach computers to recognize that data to give definitive answers. It’s a huge field which can be used to answer questions or interesting ideas like,
 How do I teach a computer to recognize an image?
 How can I train a computer bot to talk like a human?
 How can I train a computer to accurately distinguish between two datasets?
Of course it’s not immediately visible about how we can do all this, but what we can is break this huge field in several components and see how they all help.
 Machine Learning, a field basically meant to provide statistical insights into building statistical models that can provide some accuracy to answer questions such as providing a prediction of some value ( regression ) or classifying data into
yes
orno
( classification ).  Deep Learning, popularized because of Neural Networks, a way to identify and extract features from a given data, and allowing that to more accurately provide insights into problems of Speech ( Recurrent Neural Networks, NLP ), or Images ( OCR, Convoluational Neural Networks ).
 Data Analysis, a kind of sub field in Data Science to provide visual understanding of the data at hand and to show the trends in data
 Data Engineering, a way to develop data pipelines, maintaing proper data sets and schemas, and to understanding who has acces to what data and how. Responsible for keeping check for the out flow and in flow of data, and for providing insights into the data streaming in and out
 Artifical Intelligence, the grass root that motivated Data Science to move away it from Statistics And Probability into a complete field of its own. It’s a way of thinking that presents the main theoratical models that has inspired the idea of teching computers to think and act like humans.
Technologies
There are a lot of technologies for each and every sub field of data science, language
wise, you can use,
 SQL
 R
 Python
To extract patterns and the required data. Libraries? R has a ton of them for statistical measures. Python has them, too. Some common ones from Python are.
 Tensorflow
 ScikitLearn
General
Getting Started
 Domingos, Pedro. “A few useful things to know about machine learning.” Communications of the ACM 55, no. 10 (2012): 7887
 Shewchuk, Jonathan Richard. “An Introduction to the Conjugate Gradient Method Without the Agonizing Pain.” 1994
 To understand cost functions better An Introduction To Understanding Cost Functions
People To Follow
The following people are the big guys in Machine Learning and Deep Learning. You can follow them on twitter and quora. It gives a peek into their lives and how they think. And sometimes their discussions are a great source of learning too.
 Geoffrey Hinton
 Yoshua Bengio
 Yann LeCunn
 Andrew Ng
 Christopher Manning
 Andrej Karpathy
 Ian Goodfellow
 Fei Fei Li
 Nando de Freitas
 Jeremy Howard
 Rachel Thomas
 Pieter Abeel
 Richard Socher
 Francois Chollet
 Soumith Chintala
List courtesy of FAST’s leading AI society, AIMLC, under ACM
FAQ

So what’s the difference between Data Science and Data Engineering?

Skills or certificates?

What else should I look into? Is Data Science is a complete field in it of itself?

Who should I follow for Data Science?

What’s Data Mining?!

Should I really good at Maths to get started with anything at all?

I’m not a very good coder. What can I do?

I’m not a research kind of person. What can I do?

I’m just a student! What can I do?
Learning Materials
Books To Look Into
 Building Machine Learning Pipelines by Hannes Hapke & Catherine Nelson  August 2020
 Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville
 Hands On Machine Learning With Scikit Learn, Keras, And Tensorflow – O’Reilly, Aurelien Geron
 Pattern Recognition And Machine Learning, Christopher M. Bishop
 An Introduction to Statistical Learning – Trevor Hastie
 Machine Learning: A Probabilistic Perspective – Kevin P. Murphy
 Elements of Statistical Learning – Trevor Hastie
 Deep Learning – Ian Goodfellow, Yoshua Bengio
Misc Books
 Automate The Boring Stuff With Python, Practical Programming For Total Beginners, Al Sweigart
Competitions
Blogging
If you like writing blog posts for you data science findings, try out these
Research Purposes
 Arxiv Sanity
 Made With ML
 Papers With Code
 Kaggle / Dataquest / Google / Open Images / Big Bad NLP / Awesome Data Labelling Datasets
 IEEE papers
Artifical Intelligence Concept List
 CS50’s Artificial Intelligence With Python
 Graph search, Heuristic search, Adversarial search, AlphaBeta Pruning
 Knowledge Representation, Propositional Logic, Inference, Resolution, FirstOrder Logic
 Probability, Random Variables, Probabilistic Inference, Bayesian Networks, Markov Networks
 Local Search, Hill Climbing, Constraint Satisfaction, Backtracking Search
 Classification, Regression, Support Vector Machines, Reinforcement Learning Clustering
 FeedForward Networks, BackPropagation, Convolutional Networks, Recurrent Networks
 ContextFree Grammar, NGram Models, Naïve Bayes, tfidf, word2vec
Full Stack Data Science
 Setting Up Machine Learning Projects,
 Infrastructure And Tooling
 Data Management
 Machine Learning Teams
 Training And Debugging
 Testing And Deployment
 Research Areas
Machine Learning + Deep Learning Topics
 Methods
 Dimensionality Reduction
 Ensemble Learning
 Meta Learning
 Reinforcement Learning
 Supervised Learning
 Unsupervised Learning
 Semisupervised Learning
 Deep Learning
 Anomaly Detection
 PAC Learning
 Regression
 Statistical Learning
 Structural Prediction
 Feature Learning
 Feature Engineering
 Unsupervised Learning
 BiasVariance Dilemma
 Association Rules
 Pretrained Models ( for transfer learning )
 Tensorflow Hub
 PyTorch Hub
 HuggingFace Transformers
 Detectron 2
 Experiment Tracking
 TensorBoard, track and visualize metrics, view model graphs, look at images, text and audio data, integrated with TensorFlow and PyTorch
 Dashboard by Weights & Biases
 Neptune.ai
 Data And Model Tracking
 Artifacts by Weights & Biases
 DVC ( Data Version Control )
 Cloud Computer Services
 AWS – Sagemaker
 GCP – AI Platform
( P.S, Maybe invest in hardware? )
Machine Learning frameworks and libraries in C++
This section is courtest of @Andrew Ng!
 mlpack: a scalable C++ machine learning library.
 SHARK: a fast, modular, featurerich opensource C++ machine learning library.
 Dlibml: A Machine Learning Toolkit.
 Waffles: A collection of commandline tools for researchers in machine learning, data mining, and related fields. All of the functionality is also provided in a clean C++ class library.
 MLC++: a library of C++ classes for supervised machine learning.
Datasets
Mathematical Background

Linear Algebra a. Essence of Linear Algebra  3blue1brown (YouTube) _ b. MIT 18.06 Linear Algebra – Gilbert Strang _ c. Coding the matrix – Brown University

Probability and Statistics a. MIT 6.041 Probabilistic Systems Analysis and Applied Probability – John Tsitsiklis _ b. Statistics 110 – Harvard University c. Statistics with R – Duke University (Coursera) d. Discovering Statistics using SPSS – Andy Field (Book) _

Calculus a. Essence of Calculus – 3blue1brown (YouTube) _ b. MIT Highlights of Calculus – Gilbert Strang _ c. MIT 18.01 Single Variable Calculus – MIT OCW d. MIT 18.02 Multivariable Calculus – MIT OCW
Reinforcement Learning
 Deep RL Bootcamp, https://sites.google.com/view/deeprlbootcamp/lectures
 Dennybritz  Reinforcement Learning, https://github.com/dennybritz/reinforcementlearning
 Incomplete Ideas – Reinforcement Learning, http://www.incompleteideas.net/book/RLbook2018.pdf
 RL Course by David Silver, Lecture 1, https://www.youtube.com/watch?v=2pWv7GOvuf0
Data Engineering / Analysis
 Technology Stack
 Query Languages: SQL
 Programming Languages: Python, Go , R
 Data Warehousing: Amazon S3, Hive, HDFS, HBase
 Data Store: DynamoDB, Hadoop, MongoDB, Postgresql
 Data Infrastructure: AWS, GCP
 Data Handling Frameworks
 Apache Spark + Apache Flink ( Stream Processing )
 Apache Beam + Apache Kafka ( Batch, Parallel Processing )
 Linux: Cron Jobs, ETL
 Data Visualization: Tableau
Courses
 Introduction To Machine Learning, Coursera, Andrew Ng
 Deeplearning.Ai Specialization, Coursera, Andrew Ng
 Tensorfow in Practice, Coursera, George
 Data Engineering With Google cloud Certificate, Coursera
 The UNIX workbench, Coursera
 Data Science: Statistics And Machine Leaning Specialization, Coursera
 Data Science: Foundations Using R Specialization, Coursera
 DevOps Culture And Mindset
 Machine Learning With Tensorflow On Google Cloud Platform
 FAST AI MOOC, Part I and II
 CS231n, A MIT Open Course specifically for Computer Vision
 CS224n, A MIT Open Course specifically for Natural Language Processing
 Mathematics for Machine Learning Specialization
 Depth First Learning
 CS224w ML with Graphs
 CS229 Machine Learning  Stanford  This is the Stanford CS course on Machine Learning that Prof Ng has taught for a number of years. The material parallels the Coursera course, but covers some additional topics and goes into much more depth on the mathematics.
 Cornell Virtual Workshop
 Dive Into Machine Learning
Interesting Applications
 Castillo, Carlos, Marcelo Mendoza, and Barbara Poblete. “Information credibility on Twitter.” In Proceedings of the 20th international conference on World wide web, pp. 675684. ACM, 2011.
 Norman, Kenneth A., Sean M. Polyn, Greg J. Detre, and James V. Haxby. “Beyond mindreading: multivoxel pattern analysis of fMRI data.” Trends in cognitive sciences 10, no. 9 (2006): 424430.
 Pereira, Francisco, Tom Mitchell, and Matthew Botvinick. “Machine learning classifiers and fMRI: a tutorial overview.” Neuroimage 45, no. 1 Suppl (2009): S199.
Youtube Channels
 For Causal Inference, I’d highly recommend @mattmasten’s Causal Inference bootcamp. Over 100 videos to understand ideas like counterfactuals, instrumental variables, differencesindifferences, regression discontinuity… (from an econ/ss perspective) /10Mod•U: Powerful Concepts in Social Science
Courtesy of @sannykimchi
MISC
Information that doesn’t really fall into a concrete place here,
 Are you ever confused about cluster computing, containers or scaling experiments? Then Stanford’s @stats285 might be a great way to better understand cloud computing, distributed tools and research infrastructure /8Massive Computational Experiments, Painlessly
Courtesy of @sannykimchi
References
 A mind map of the things listed above, thanks to Daniel Bourke, you can check out his YT channel here
 Data Engineering, comprehensive list to get started
 See for Airbnb / Airflow related articles, here at A Beginner’s Guide To Data Engineering, and here at Using Machine Learning To Predict Value Of Homes On Airbnb