Data Science

Data Science is a HUGE field, and it has been called one of the most highest paying jobs for the coming years. It’s a long journey, but here’s a way to get started with it!

Table Of Contents

Please Read First

This guide is meant to present Data Science as a full stack field, not limited to just building models, but rather exploring the market possibilities and the tools that fortune 500 companies to small scale companies use for to make business related decisions for the agile moto of quick to market. Take this guide to not only getting good at making models for data, but rather diving deep into how you think about handling data in anycase, and in the different ways you can easily incorporate data science to make some easy predictions!

An Introduction

Data Science refers to the statistical analysis of the data for patterns and trends to make business decisions and to teach computers to recognize that data to give definitive answers. It’s a huge field which can be used to answer questions or interesting ideas like,

  1. How do I teach a computer to recognize an image?
  2. How can I train a computer bot to talk like a human?
  3. How can I train a computer to accurately distinguish between two datasets?

Of course it’s not immediately visible about how we can do all this, but what we can is break this huge field in several components and see how they all help.

  • Machine Learning, a field basically meant to provide statistical insights into building statistical models that can provide some accuracy to answer questions such as providing a prediction of some value ( regression ) or classifying data into yes or no ( classification ).
  • Deep Learning, popularized because of Neural Networks, a way to identify and extract features from a given data, and allowing that to more accurately provide insights into problems of Speech ( Recurrent Neural Networks, NLP ), or Images ( OCR, Convoluational Neural Networks ).
  • Data Analysis, a kind of sub field in Data Science to provide visual understanding of the data at hand and to show the trends in data
  • Data Engineering, a way to develop data pipelines, maintaing proper data sets and schemas, and to understanding who has acces to what data and how. Responsible for keeping check for the out flow and in flow of data, and for providing insights into the data streaming in and out
  • Artifical Intelligence, the grass root that motivated Data Science to move away it from Statistics And Probability into a complete field of its own. It’s a way of thinking that presents the main theoratical models that has inspired the idea of teching computers to think and act like humans.

Technologies

There are a lot of technologies for each and every sub field of data science, language wise, you can use,

  • SQL
  • R
  • Python

To extract patterns and the required data. Libraries? R has a ton of them for statistical measures. Python has them, too. Some common ones from Python are.

  • Tensorflow
  • ScikitLearn

General

Getting Started

People To Follow

The following people are the big guys in Machine Learning and Deep Learning. You can follow them on twitter and quora. It gives a peek into their lives and how they think. And sometimes their discussions are a great source of learning too.

  • Geoffrey Hinton
  • Yoshua Bengio
  • Yann LeCunn
  • Andrew Ng
  • Christopher Manning
  • Andrej Karpathy
  • Ian Goodfellow
  • Fei Fei Li
  • Nando de Freitas
  • Jeremy Howard
  • Rachel Thomas
  • Pieter Abeel
  • Richard Socher
  • Francois Chollet
  • Soumith Chintala

List courtesy of FAST’s leading AI society, AIMLC, under ACM

FAQ

  1. So what’s the difference between Data Science and Data Engineering?

  2. Skills or certificates?

  3. What else should I look into? Is Data Science is a complete field in it of itself?

  4. Who should I follow for Data Science?

  5. What’s Data Mining?!

  6. Should I really good at Maths to get started with anything at all?

  7. I’m not a very good coder. What can I do?

  8. I’m not a research kind of person. What can I do?

  9. I’m just a student! What can I do?

Learning Materials

Books To Look Into

  • Building Machine Learning Pipelines by Hannes Hapke & Catherine Nelson - August 2020
  • Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville
  • Hands On Machine Learning With Scikit Learn, Keras, And Tensorflow – O’Reilly, Aurelien Geron
  • Pattern Recognition And Machine Learning, Christopher M. Bishop
  • An Introduction to Statistical Learning – Trevor Hastie
  • Machine Learning: A Probabilistic Perspective – Kevin P. Murphy
  • Elements of Statistical Learning – Trevor Hastie
  • Deep Learning – Ian Goodfellow, Yoshua Bengio

Misc Books

  • Automate The Boring Stuff With Python, Practical Programming For Total Beginners, Al Sweigart

Competitions

Blogging

If you like writing blog posts for you data science findings, try out these

Research Purposes

  • Arxiv Sanity
  • Made With ML
  • Papers With Code
  • Kaggle / Dataquest / Google / Open Images / Big Bad NLP / Awesome Data Labelling Datasets
  • IEEE papers

Artifical Intelligence Concept List

  • CS50’s Artificial Intelligence With Python
    • Graph search, Heuristic search, Adversarial search, Alpha-Beta Pruning
    • Knowledge Representation, Propositional Logic, Inference, Resolution, First-Order Logic
    • Probability, Random Variables, Probabilistic Inference, Bayesian Networks, Markov Networks
    • Local Search, Hill Climbing, Constraint Satisfaction, Backtracking Search
    • Classification, Regression, Support Vector Machines, Reinforcement Learning Clustering
    • Feed-Forward Networks, Back-Propagation, Convolutional Networks, Recurrent Networks
    • Context-Free Grammar, N-Gram Models, Naïve Bayes, tf-idf, word2vec

Full Stack Data Science

Machine Learning + Deep Learning Topics

  • Methods
    • Dimensionality Reduction
    • Ensemble Learning
    • Meta Learning
    • Reinforcement Learning
    • Supervised Learning
    • Unsupervised Learning
    • Semi-supervised Learning
    • Deep Learning
    • Anomaly Detection
    • PAC Learning
    • Regression
    • Statistical Learning
    • Structural Prediction
    • Feature Learning
    • Feature Engineering
    • Unsupervised Learning
    • Bias-Variance Dilemma
    • Association Rules
  • Pretrained Models ( for transfer learning )
    • Tensorflow Hub
    • PyTorch Hub
    • HuggingFace Transformers
    • Detectron 2
  • Experiment Tracking
    • TensorBoard, track and visualize metrics, view model graphs, look at images, text and audio data, integrated with TensorFlow and PyTorch
    • Dashboard by Weights & Biases
    • Neptune.ai
  • Data And Model Tracking
    • Artifacts by Weights & Biases
    • DVC ( Data Version Control )
  • Cloud Computer Services
    • AWS – Sagemaker
    • GCP – AI Platform

( P.S, Maybe invest in hardware? )

Machine Learning frameworks and libraries in C++

This section is courtest of @Andrew Ng!

  • mlpack: a scalable C++ machine learning library.
  • SHARK: a fast, modular, feature-rich open-source C++ machine learning library.
  • Dlib-ml: A Machine Learning Toolkit.
  • Waffles: A collection of command-line tools for researchers in machine learning, data mining, and related fields. All of the functionality is also provided in a clean C++ class library.
  • MLC++: a library of C++ classes for supervised machine learning.

Datasets

Mathematical Background

  • Linear Algebra a. Essence of Linear Algebra - 3blue1brown (YouTube) _ b. MIT 18.06 Linear Algebra – Gilbert Strang _ c. Coding the matrix – Brown University

  • Probability and Statistics a. MIT 6.041 Probabilistic Systems Analysis and Applied Probability – John Tsitsiklis _ b. Statistics 110 – Harvard University c. Statistics with R – Duke University (Coursera) d. Discovering Statistics using SPSS – Andy Field (Book) _

  • Calculus a. Essence of Calculus – 3blue1brown (YouTube) _ b. MIT Highlights of Calculus – Gilbert Strang _ c. MIT 18.01 Single Variable Calculus – MIT OCW d. MIT 18.02 Multivariable Calculus – MIT OCW

Reinforcement Learning

  • Deep RL Bootcamp, https://sites.google.com/view/deep-rl-bootcamp/lectures
  • Dennybritz - Reinforcement Learning, https://github.com/dennybritz/reinforcement-learning
  • Incomplete Ideas – Reinforcement Learning, http://www.incompleteideas.net/book/RLbook2018.pdf
  • RL Course by David Silver, Lecture 1, https://www.youtube.com/watch?v=2pWv7GOvuf0

Data Engineering / Analysis

  • Technology Stack
    • Query Languages: SQL
    • Programming Languages: Python, Go , R
    • Data Warehousing: Amazon S3, Hive, HDFS, HBase
    • Data Store: DynamoDB, Hadoop, MongoDB, Postgresql
    • Data Infrastructure: AWS, GCP
  • Data Handling Frameworks
    • Apache Spark + Apache Flink ( Stream Processing )
    • Apache Beam + Apache Kafka ( Batch, Parallel Processing )
  • Linux: Cron Jobs, ETL
  • Data Visualization: Tableau

Courses

Interesting Applications

Youtube Channels

  • For Causal Inference, I’d highly recommend @mattmasten’s Causal Inference bootcamp. Over 100 videos to understand ideas like counterfactuals, instrumental variables, differences-in-differences, regression discontinuity… (from an econ/ss perspective) /10Mod•U: Powerful Concepts in Social Science

    Courtesy of @sannykimchi

MISC

Information that doesn’t really fall into a concrete place here,

  • Are you ever confused about cluster computing, containers or scaling experiments? Then Stanford’s @stats285 might be a great way to better understand cloud computing, distributed tools and research infrastructure /8Massive Computational Experiments, Painlessly

    Courtesy of @sannykimchi

References