Notes on Practical Statistics for Data Scientists π
This is my personal notes of the book Practical Statistics for Data Scientists - 50+ Essential Concepts Using R and Python by Peter Bruce, Andrew Bruce, Peter Gedeck. I will update this post, as I study and digest the contents of this book.
Chapter 1 - Exploratory Data Analysis
John W. Tukey established the field of exploratory data analysis through his 1977 publication exploratory data analysis, in which he introduced methods to explore a dataset by using plots and summary statistics (mean, median etc.). Later in 2015, one of Tukeyβs former undergraduate student David Donoho published a summary article 50 years of Data Science showing the genesis and developments of data science as a field.
- Before analyzing the data it is important to identify the type of data to be studied.
- Type of data can influence the kind of data analysis methods which can be used to explore the data.
The figure below provides the taxonomy of data types.
mindmap
root((Data types))
Numeric
Continuous (Continuous - Data that can contain any value in an interval)
Discrete (Discrete - Data that can contain only integer values)
Categorical
Binary (Binary - Data that contain just two categories )
Ordinal (Ordinal - Data that is explicitly ordered)
Ordinal data
The interesting data type above is the ordinal data type where the order of the data is important. Here is an example of ordinal data type using sklearn.
1
2
3
4
5
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()
X = [['Male', 1], ['Female', 3], ['Female', 2]]
enc.fit(X)
enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
1
enc.transform([['Female', 3], ['Male', 1]])
array([[0., 2.],[1., 0.]])
Explanation of enc.transform
First sample: ['Female', 3]
1
2
3
Gender column: 'Female' is the first category in enc.categories_ β encoded as 0
Number column: 3 is the third category in enc.categories_[1] β encoded as 2
Result: [0., 2.]
Second sample: ['Male', 1]
1
2
3
Gender column: 'Male' is the second category β encoded as 1
Number column: 1 is the first category β encoded as 0
Result: [1., 0.]
1
enc.inverse_transform([[1, 1], [0, 2]])
array([['Male', 2],['Female', 3]], dtype=object)
Rectangular data
Typically analysis in data science focuses on rectangular data. It is a two-dimensional matrix containing records in form of rows and features/variables in form of columns. Rectangular data is usually the result of some preprocessing of unstructured data. Following are the key terms in rectangular data.
mindmap
root((Rectangular data))
Dataframe
A basic data structure used in statistical and machine learning models
Feature
Each column within a table is referred to as a feature
Outcome
The dependent variable. Output variable which is dependent on one or many features. Also called target, response, output.
Records
Each row in a table. Can be defined as singular case, scenario, observation, pattern or sample
Non rectangular data structures
Time Series data records sequential measurements of a same variable with time. This kind of data is used to create statistical forecasting models. An example is a IOT sensor capturing temperature data every 2 minutes perpetually. Such data structures always need to include a time at which the record was captured.
Spatial data can be used to create location based analytics. The object under observation can be for example a house or a point of interest in a map and its spatial coordinates.
Graph/ Network are used to represent abstract relationships between the object under observation. An example can be a social network of a person showing how many contacts or friends that person has and how often he/she interacts with them. These types of data is useful in recommender systems and optimization problems.
All these three can also be combined in a single use case. For example, Google maps can store spatial data in a time series manner for a person and include a graph/ network data on how the user interacts with other spatial objects (shops, landmarks) when they travel 60 kms away from their home.
Difference in terminologies
Statisticians use predicator variables to predict a response or dependent variable and data scientist use features to predit a target
The term sample to a computer scientist signifies a single row while a sample to a statistician means a collection of rows.
Graph in statistics can mean plots or visualization and not just connections of entities as it is in computer science or information technology.
Related literature
Jon Tukey : Exploratory Data Analysis
David Donoho : 50 years of Data Science
Bruce, P., Bruce, A., & Gedeck, P. (2020). Practical statistics for data scientists: 50+ essential concepts using R and Python. OβReilly Media.