Data Science
Data Science is a HUGE field, and it has been called one of the most highest paying jobs for the coming years. It’s a long journey, but here’s a way to get started with it!
Table Of Contents
- Please Read First
- An Introduction
- Technologies
- General
- Getting Started
- People To Follow
- FAQ
- Learning Materials
- Books To Look Into
- Misc Books
- Competitions
- Blogging
- Research Purposes
- Artifical Intelligence Concept List
- Full Stack Data Science
- Machine Learning + Deep Learning Topics
- Machine Learning frameworks and libraries in C++
- Datasets
- Mathematical Background
- Reinforcement Learning
- Data Engineering / Analysis
- Courses
- Interesting Applications
- Youtube Channels
- MISC
- References
Please Read First
This guide is meant to present Data Science
as a full stack field, not limited to just building models, but rather exploring the market possibilities and the tools that fortune 500 companies to small scale companies use for to make business related decisions for the agile
moto of quick to market
. Take this guide to not only getting good at making models for data, but rather diving deep into how you think about handling data in anycase, and in the different ways you can easily incorporate data science to make some easy predictions!
An Introduction
Data Science refers to the statistical analysis of the data for patterns and trends to make business decisions and to teach computers to recognize that data to give definitive answers. It’s a huge field which can be used to answer questions or interesting ideas like,
- How do I teach a computer to recognize an image?
- How can I train a computer bot to talk like a human?
- How can I train a computer to accurately distinguish between two datasets?
Of course it’s not immediately visible about how we can do all this, but what we can is break this huge field in several components and see how they all help.
- Machine Learning, a field basically meant to provide statistical insights into building statistical models that can provide some accuracy to answer questions such as providing a prediction of some value ( regression ) or classifying data into
yes
orno
( classification ). - Deep Learning, popularized because of Neural Networks, a way to identify and extract features from a given data, and allowing that to more accurately provide insights into problems of Speech ( Recurrent Neural Networks, NLP ), or Images ( OCR, Convoluational Neural Networks ).
- Data Analysis, a kind of sub field in Data Science to provide visual understanding of the data at hand and to show the trends in data
- Data Engineering, a way to develop data pipelines, maintaing proper data sets and schemas, and to understanding who has acces to what data and how. Responsible for keeping check for the out flow and in flow of data, and for providing insights into the data streaming in and out
- Artifical Intelligence, the grass root that motivated Data Science to move away it from Statistics And Probability into a complete field of its own. It’s a way of thinking that presents the main theoratical models that has inspired the idea of teching computers to think and act like humans.
Technologies
There are a lot of technologies for each and every sub field of data science, language
wise, you can use,
- SQL
- R
- Python
To extract patterns and the required data. Libraries? R has a ton of them for statistical measures. Python has them, too. Some common ones from Python are.
- Tensorflow
- ScikitLearn
General
Getting Started
- Domingos, Pedro. “A few useful things to know about machine learning.” Communications of the ACM 55, no. 10 (2012): 78-87
- Shewchuk, Jonathan Richard. “An Introduction to the Conjugate Gradient Method Without the Agonizing Pain.” 1994
- To understand cost functions better An Introduction To Understanding Cost Functions
People To Follow
The following people are the big guys in Machine Learning and Deep Learning. You can follow them on twitter and quora. It gives a peek into their lives and how they think. And sometimes their discussions are a great source of learning too.
- Geoffrey Hinton
- Yoshua Bengio
- Yann LeCunn
- Andrew Ng
- Christopher Manning
- Andrej Karpathy
- Ian Goodfellow
- Fei Fei Li
- Nando de Freitas
- Jeremy Howard
- Rachel Thomas
- Pieter Abeel
- Richard Socher
- Francois Chollet
- Soumith Chintala
List courtesy of FAST’s leading AI society, AIMLC, under ACM
FAQ
-
So what’s the difference between Data Science and Data Engineering?
-
Skills or certificates?
-
What else should I look into? Is Data Science is a complete field in it of itself?
-
Who should I follow for Data Science?
-
What’s Data Mining?!
-
Should I really good at Maths to get started with anything at all?
-
I’m not a very good coder. What can I do?
-
I’m not a research kind of person. What can I do?
-
I’m just a student! What can I do?
Learning Materials
Books To Look Into
- Building Machine Learning Pipelines by Hannes Hapke & Catherine Nelson - August 2020
- Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville
- Hands On Machine Learning With Scikit Learn, Keras, And Tensorflow – O’Reilly, Aurelien Geron
- Pattern Recognition And Machine Learning, Christopher M. Bishop
- An Introduction to Statistical Learning – Trevor Hastie
- Machine Learning: A Probabilistic Perspective – Kevin P. Murphy
- Elements of Statistical Learning – Trevor Hastie
- Deep Learning – Ian Goodfellow, Yoshua Bengio
Misc Books
- Automate The Boring Stuff With Python, Practical Programming For Total Beginners, Al Sweigart
Competitions
Blogging
If you like writing blog posts for you data science findings, try out these
Research Purposes
- Arxiv Sanity
- Made With ML
- Papers With Code
- Kaggle / Dataquest / Google / Open Images / Big Bad NLP / Awesome Data Labelling Datasets
- IEEE papers
Artifical Intelligence Concept List
- CS50’s Artificial Intelligence With Python
- Graph search, Heuristic search, Adversarial search, Alpha-Beta Pruning
- Knowledge Representation, Propositional Logic, Inference, Resolution, First-Order Logic
- Probability, Random Variables, Probabilistic Inference, Bayesian Networks, Markov Networks
- Local Search, Hill Climbing, Constraint Satisfaction, Backtracking Search
- Classification, Regression, Support Vector Machines, Reinforcement Learning Clustering
- Feed-Forward Networks, Back-Propagation, Convolutional Networks, Recurrent Networks
- Context-Free Grammar, N-Gram Models, Naïve Bayes, tf-idf, word2vec
Full Stack Data Science
- Setting Up Machine Learning Projects,
- Infrastructure And Tooling
- Data Management
- Machine Learning Teams
- Training And Debugging
- Testing And Deployment
- Research Areas
Machine Learning + Deep Learning Topics
- Methods
- Dimensionality Reduction
- Ensemble Learning
- Meta Learning
- Reinforcement Learning
- Supervised Learning
- Unsupervised Learning
- Semi-supervised Learning
- Deep Learning
- Anomaly Detection
- PAC Learning
- Regression
- Statistical Learning
- Structural Prediction
- Feature Learning
- Feature Engineering
- Unsupervised Learning
- Bias-Variance Dilemma
- Association Rules
- Pretrained Models ( for transfer learning )
- Tensorflow Hub
- PyTorch Hub
- HuggingFace Transformers
- Detectron 2
- Experiment Tracking
- TensorBoard, track and visualize metrics, view model graphs, look at images, text and audio data, integrated with TensorFlow and PyTorch
- Dashboard by Weights & Biases
- Neptune.ai
- Data And Model Tracking
- Artifacts by Weights & Biases
- DVC ( Data Version Control )
- Cloud Computer Services
- AWS – Sagemaker
- GCP – AI Platform
( P.S, Maybe invest in hardware? )
Machine Learning frameworks and libraries in C++
This section is courtest of @Andrew Ng!
- mlpack: a scalable C++ machine learning library.
- SHARK: a fast, modular, feature-rich open-source C++ machine learning library.
- Dlib-ml: A Machine Learning Toolkit.
- Waffles: A collection of command-line tools for researchers in machine learning, data mining, and related fields. All of the functionality is also provided in a clean C++ class library.
- MLC++: a library of C++ classes for supervised machine learning.
Datasets
Mathematical Background
-
Linear Algebra a. Essence of Linear Algebra - 3blue1brown (YouTube) _ b. MIT 18.06 Linear Algebra – Gilbert Strang _ c. Coding the matrix – Brown University
-
Probability and Statistics a. MIT 6.041 Probabilistic Systems Analysis and Applied Probability – John Tsitsiklis _ b. Statistics 110 – Harvard University c. Statistics with R – Duke University (Coursera) d. Discovering Statistics using SPSS – Andy Field (Book) _
-
Calculus a. Essence of Calculus – 3blue1brown (YouTube) _ b. MIT Highlights of Calculus – Gilbert Strang _ c. MIT 18.01 Single Variable Calculus – MIT OCW d. MIT 18.02 Multivariable Calculus – MIT OCW
Reinforcement Learning
- Deep RL Bootcamp, https://sites.google.com/view/deep-rl-bootcamp/lectures
- Dennybritz - Reinforcement Learning, https://github.com/dennybritz/reinforcement-learning
- Incomplete Ideas – Reinforcement Learning, http://www.incompleteideas.net/book/RLbook2018.pdf
- RL Course by David Silver, Lecture 1, https://www.youtube.com/watch?v=2pWv7GOvuf0
Data Engineering / Analysis
- Technology Stack
- Query Languages: SQL
- Programming Languages: Python, Go , R
- Data Warehousing: Amazon S3, Hive, HDFS, HBase
- Data Store: DynamoDB, Hadoop, MongoDB, Postgresql
- Data Infrastructure: AWS, GCP
- Data Handling Frameworks
- Apache Spark + Apache Flink ( Stream Processing )
- Apache Beam + Apache Kafka ( Batch, Parallel Processing )
- Linux: Cron Jobs, ETL
- Data Visualization: Tableau
Courses
- Introduction To Machine Learning, Coursera, Andrew Ng
- Deeplearning.Ai Specialization, Coursera, Andrew Ng
- Tensorfow in Practice, Coursera, George
- Data Engineering With Google cloud Certificate, Coursera
- The UNIX workbench, Coursera
- Data Science: Statistics And Machine Leaning Specialization, Coursera
- Data Science: Foundations Using R Specialization, Coursera
- DevOps Culture And Mindset
- Machine Learning With Tensorflow On Google Cloud Platform
- FAST AI MOOC, Part I and II
- CS231n, A MIT Open Course specifically for Computer Vision
- CS224n, A MIT Open Course specifically for Natural Language Processing
- Mathematics for Machine Learning Specialization
- Depth First Learning
- CS224w ML with Graphs
- CS229 Machine Learning - Stanford - This is the Stanford CS course on Machine Learning that Prof Ng has taught for a number of years. The material parallels the Coursera course, but covers some additional topics and goes into much more depth on the mathematics.
- Cornell Virtual Workshop
- Dive Into Machine Learning
Interesting Applications
- Castillo, Carlos, Marcelo Mendoza, and Barbara Poblete. “Information credibility on Twitter.” In Proceedings of the 20th international conference on World wide web, pp. 675-684. ACM, 2011.
- Norman, Kenneth A., Sean M. Polyn, Greg J. Detre, and James V. Haxby. “Beyond mind-reading: multi-voxel pattern analysis of fMRI data.” Trends in cognitive sciences 10, no. 9 (2006): 424-430.
- Pereira, Francisco, Tom Mitchell, and Matthew Botvinick. “Machine learning classifiers and fMRI: a tutorial overview.” Neuroimage 45, no. 1 Suppl (2009): S199.
Youtube Channels
- For Causal Inference, I’d highly recommend @mattmasten’s Causal Inference bootcamp. Over 100 videos to understand ideas like counterfactuals, instrumental variables, differences-in-differences, regression discontinuity… (from an econ/ss perspective) /10Mod•U: Powerful Concepts in Social Science
Courtesy of @sannykimchi
MISC
Information that doesn’t really fall into a concrete place here,
- Are you ever confused about cluster computing, containers or scaling experiments? Then Stanford’s @stats285 might be a great way to better understand cloud computing, distributed tools and research infrastructure /8Massive Computational Experiments, Painlessly
Courtesy of @sannykimchi
References
- A mind map of the things listed above, thanks to Daniel Bourke, you can check out his YT channel here
- Data Engineering, comprehensive list to get started
- See for Airbnb / Airflow related articles, here at A Beginner’s Guide To Data Engineering, and here at Using Machine Learning To Predict Value Of Homes On Airbnb
- Machine Learning Crash Course, With TensorFlow APIs