What is in my data?

According to Campusx-YouTube these are some of the suggested questions to answer when a new dataset is obtained.

mindmap
  root((Quick Data Exploration))
    1 - How big is the data? 
    2 - How does the data look like?
    3 - What type of datatypes are found in the data?
    4 - Are there missing data in this dataset?
    5 - How does the data look mathematically?
    6 - Are there duplicate rows in this data?
    7 - What is the correlation between the columns in this data?

Importing modules

import duckdb
import polars as pl
import seaborn as sn
import pandas as pd
import matplotlib.pyplot as plt

Reading data

# ts is in epoch time format so converting it to timestamp
# rounding values for temperature and humidity
# converting temperature from farhaneit to celsius

input_data_path = f"../data/iot/iot_telemetry_data.parquet"


df_raw = duckdb.sql(
    f"SELECT ts, to_timestamp(ts) AS timestamp, device, temp,ROUND((temp - 32) * 5.0 / 9, 4) AS temp_c, ROUND(humidity, 4) AS humidity, lpg, smoke, light FROM '{input_data_path}'"
)

Exploring the data

The seven questions to get insight into the data

How big is the data?

# Converting to polars to easy statistics and exploration
df_pl = df_raw.pl()  
df_pl.shape

(405184, 9)

How does the data look like?

df_pl.head()

shape: (5, 9)

ts	timestamp	device	temp	temp_c	humidity	lpg	smoke	light
f64	datetime[μs, Europe/Oslo]	str	f64	f64	f64	f64	f64	bool
1.5945e9	2020-07-12 02:01:34.385975 CEST	"b8:27:eb:bf:9d:51"	22.7	-5.1667	51.0	0.007651	0.020411	false
1.5945e9	2020-07-12 02:01:34.735568 CEST	"00:0f:00:70:91:0a"	19.700001	-6.8333	76.0	0.005114	0.013275	false
1.5945e9	2020-07-12 02:01:38.073573 CEST	"b8:27:eb:bf:9d:51"	22.6	-5.2222	50.9	0.007673	0.020475	false
1.5945e9	2020-07-12 02:01:39.589146 CEST	"1c:bf:ce:15:ec:4d"	27.0	-2.7778	76.8	0.007023	0.018628	true
1.5945e9	2020-07-12 02:01:41.761235 CEST	"b8:27:eb:bf:9d:51"	22.6	-5.2222	50.9	0.007664	0.020448	false

What type of datatypes are found in the data?

duckdb.sql("SUMMARIZE df_raw;")

┌─────────────┬──────────────────────────┬───────────────────────────────┬───────────────────────────────┬───────────────┬───────────────────────────────┬───────────────────────┬───────────────────────────────┬───────────────────────────────┬───────────────────────────────┬────────┬─────────────────┐
│ column_name │       column_type        │              min              │              max              │ approx_unique │              avg              │          std          │              q25              │              q50              │              q75              │ count  │ null_percentage │
│   varchar   │         varchar          │            varchar            │            varchar            │     int64     │            varchar            │        varchar        │            varchar            │            varchar            │            varchar            │ int64  │  decimal(9,2)   │
├─────────────┼──────────────────────────┼───────────────────────────────┼───────────────────────────────┼───────────────┼───────────────────────────────┼───────────────────────┼───────────────────────────────┼───────────────────────────────┼───────────────────────────────┼────────┼─────────────────┤
│ ts          │ DOUBLE                   │ 1594512094.3859746            │ 1595203417.2643125            │        491304 │ 1594858017.2968097            │ 199498.39927628823    │ 1594686010.292624             │ 1594857972.1364734            │ 1595030547.7465725            │ 405184 │            0.00 │
│ timestamp   │ TIMESTAMP WITH TIME ZONE │ 2020-07-12 02:01:34.385975+02 │ 2020-07-20 02:03:37.264312+02 │        326811 │ 2020-07-16 02:06:57.296824+02 │ NULL                  │ 2020-07-14 02:20:10.292624+02 │ 2020-07-16 02:06:12.136474+02 │ 2020-07-18 02:02:27.746572+02 │ 405184 │            0.00 │
│ device      │ VARCHAR                  │ 00:0f:00:70:91:0a             │ b8:27:eb:bf:9d:51             │             3 │ NULL                          │ NULL                  │ NULL                          │ NULL                          │ NULL                          │ 405184 │            0.00 │
│ temp        │ DOUBLE                   │ 0.0                           │ 30.600000381469727            │           287 │ 22.453987345644748            │ 2.698346951263289     │ 19.888040295855575            │ 22.243208182088022            │ 23.516330668892394            │ 405184 │            0.00 │
│ temp_c      │ DOUBLE                   │ -17.7778                      │ -0.7778                       │           235 │ -5.303340377703994            │ 1.4990816816561081    │ -6.728858088985643            │ -5.417193168388082            │ -4.713166284527654            │ 405184 │            0.00 │
│ humidity    │ DOUBLE                   │ 1.1                           │ 99.9                          │           626 │ 60.51169394645508             │ 11.366489374741377    │ 51.025250291852586            │ 54.94414996066462             │ 74.2764440390179              │ 405184 │            0.00 │
│ lpg         │ DOUBLE                   │ 0.002693478622661808          │ 0.016567377162503137          │          8188 │ 0.007237125655057899          │ 0.001444115678768443  │ 0.006458547423805879          │ 0.007470350246685729          │ 0.00814041627569619           │ 405184 │            0.00 │
│ smoke       │ DOUBLE                   │ 0.006692096317386558          │ 0.04659011562630793           │          6997 │ 0.019263611784848263          │ 0.0040861300601221845 │ 0.017032583919261386          │ 0.019897415643395536          │ 0.021809493412627078          │ 405184 │            0.00 │
│ light       │ BOOLEAN                  │ false                         │ true                          │             2 │ NULL                          │ NULL                  │ NULL                          │ NULL                          │ NULL                          │ 405184 │            0.00 │
└─────────────┴──────────────────────────┴───────────────────────────────┴───────────────────────────────┴───────────────┴───────────────────────────────┴───────────────────────┴───────────────────────────────┴───────────────────────────────┴───────────────────────────────┴────────┴─────────────────┘

Are there missing data in this dataset?

This dataset is quiet clean, there are no missing data in the input data in any feature. Docs Reference

df_pl.null_count()

shape: (1, 9)

ts	timestamp	device	temp	temp_c	humidity	lpg	smoke	light
u32	u32	u32	u32	u32	u32	u32	u32	u32
0	0	0	0	0	0	0	0	0

How does the data look mathematically?

df_pl.describe()

shape: (9, 10)

statistic	ts	timestamp	device	temp	temp_c	humidity	lpg	smoke	light
str	f64	str	str	f64	f64	f64	f64	f64	f64
"count"	405184.0	"405184"	"405184"	405184.0	405184.0	405184.0	405184.0	405184.0	405184.0
"null_count"	0.0	"0"	"0"	0.0	0.0	0.0	0.0	0.0	0.0
"mean"	1.5949e9	"2020-07-16 02:06:57.296824+02:…	null	22.453987	-5.30334	60.511694	0.007237	0.019264	0.277718
"std"	199498.399262	null	null	2.698347	1.499082	11.366489	0.001444	0.004086	null
"min"	1.5945e9	"2020-07-12 02:01:34.385975+02:…	"00:0f:00:70:91:0a"	0.0	-17.7778	1.1	0.002693	0.006692	0.0
"25%"	1.5947e9	"2020-07-14 02:20:00.478589+02:…	null	19.9	-6.7222	51.0	0.006456	0.017024	null
"50%"	1.5949e9	"2020-07-16 02:06:28.940644+02:…	null	22.2	-5.4444	54.9	0.007489	0.01995	null
"75%"	1.5950e9	"2020-07-18 02:02:56.634962+02:…	null	23.6	-4.6667	74.3	0.00815	0.021838	null
"max"	1.5952e9	"2020-07-20 02:03:37.264312+02:…	"b8:27:eb:bf:9d:51"	30.6	-0.7778	99.9	0.016567	0.04659	1.0

Are there duplicate rows in this data?

Although the below script shows there are duplicates, it looks like polars and pandas work differently here so it is important to check both. While polars returns all occurances of the duplicate, pandas only gets the duplicated values. Hence polars shows the shape to be (26,8) while pandas returns 13 duplicated rows in this dataset

dup_mask = df_pl.is_duplicated()
duplicates = df_pl.filter(dup_mask)  # Filter to show duplicated rows only

df_pd = df_pl.to_pandas().duplicated().sum() # from pandas
df_pd
duplicates  # from polars

np.int64(13)

shape: (26, 9)

ts	timestamp	device	temp	temp_c	humidity	lpg	smoke	light
f64	datetime[μs, Europe/Oslo]	str	f64	f64	f64	f64	f64	bool
1.5945e9	2020-07-12 10:32:27.272352 CEST	"1c:bf:ce:15:ec:4d"	24.700001	-4.0556	74.2	0.006644	0.017556	true
1.5945e9	2020-07-12 10:32:27.272352 CEST	"1c:bf:ce:15:ec:4d"	24.700001	-4.0556	74.2	0.006644	0.017556	true
1.5947e9	2020-07-13 17:02:04.678796 CEST	"1c:bf:ce:15:ec:4d"	23.700001	-4.6111	61.7	0.006916	0.018325	true
1.5947e9	2020-07-13 17:02:04.678796 CEST	"1c:bf:ce:15:ec:4d"	23.700001	-4.6111	61.7	0.006916	0.018325	true
1.5948e9	2020-07-15 00:14:33.272537 CEST	"b8:27:eb:bf:9d:51"	22.6	-5.2222	48.6	0.008071	0.02161	false
…	…	…	…	…	…	…	…	…
1.5951e9	2020-07-18 16:50:31.409988 CEST	"b8:27:eb:bf:9d:51"	21.5	-5.8333	51.9	0.008912	0.024024	false
1.5951e9	2020-07-19 02:01:15.810328 CEST	"b8:27:eb:bf:9d:51"	22.6	-5.2222	52.1	0.008822	0.023765	false
1.5951e9	2020-07-19 02:01:15.810328 CEST	"b8:27:eb:bf:9d:51"	22.6	-5.2222	52.1	0.008822	0.023765	false
1.5951e9	2020-07-19 05:54:33.955044 CEST	"b8:27:eb:bf:9d:51"	22.4	-5.3333	50.8	0.008671	0.02333	false
1.5951e9	2020-07-19 05:54:33.955044 CEST	"b8:27:eb:bf:9d:51"	22.4	-5.3333	50.8	0.008671	0.02333	false

What is the correlation between the columns in this data?

We need to use pandas here as the .corr() function in pandas provides a more readable table for inspecting correlation

# Select only numeric columns for the correlation matrix
numeric_df = df_pl.select(pl.col(pl.Float64 or pl.Int64)) 
numeric_df.to_pandas().corr(method='pearson')
numeric_df.to_pandas().corr(method='spearman')
numeric_df.to_pandas().corr(method='kendall')

	ts	temp	temp_c	humidity	lpg	smoke
ts	1.000000	0.074443	0.074442	0.017752	0.014178	0.016349
temp	0.074443	1.000000	1.000000	-0.410427	0.136396	0.131891
temp_c	0.074442	1.000000	1.000000	-0.410427	0.136397	0.131891
humidity	0.017752	-0.410427	-0.410427	1.000000	-0.672113	-0.669863
lpg	0.014178	0.136396	0.136397	-0.672113	1.000000	0.999916
smoke	0.016349	0.131891	0.131891	-0.669863	0.999916	1.000000

	ts	temp	temp_c	humidity	lpg	smoke
ts	1.000000	0.055377	0.055408	0.051560	0.077576	0.077576
temp	0.055377	1.000000	0.999994	-0.334051	0.121469	0.121469
temp_c	0.055408	0.999994	1.000000	-0.334268	0.121609	0.121609
humidity	0.051560	-0.334051	-0.334268	1.000000	-0.764612	-0.764612
lpg	0.077576	0.121469	0.121609	-0.764612	1.000000	1.000000
smoke	0.077576	0.121469	0.121609	-0.764612	1.000000	1.000000

	ts	temp	temp_c	humidity	lpg	smoke
ts	1.000000	0.035450	0.035481	0.030211	0.046115	0.046115
temp	0.035450	1.000000	0.999815	-0.196422	0.037934	0.037934
temp_c	0.035481	0.999815	1.000000	-0.196596	0.038080	0.038080
humidity	0.030211	-0.196422	-0.196596	1.000000	-0.559969	-0.559969
lpg	0.046115	0.037934	0.038080	-0.559969	1.000000	1.000000
smoke	0.046115	0.037934	0.038080	-0.559969	1.000000	1.000000

Exploratory data analysis (EDA)

mindmap
  root((EDA))
    **Univariate data analysis** <br>Independent analysis of individual columns of the dataset. 
    **Bivariate data analysis** <br> Pairwise analysis of two columns of the dataset. 
    **Multivariate data analysis**<br> Simultaneously analysis of two or more columns.

Univariate data analysis

Handling categorial data

Countplot

Count plots provide the unique values of the column and their frequency of occurance. This provides a understanding on how the data is spread across the classes / categories of the

sn.countplot(df_pl.select(pl.col('device')), x='device')

PieChart

A PieChart can be customized to show both the count of categories and their percentages in the dataset.

vc = df_pl.to_pandas()['device'].value_counts()
vc.plot(kind='pie', autopct=lambda pct: f'{int(round(pct * vc.sum() / 100))}\n({pct:.1f}%)')

Handling numerical data

Histogram

Histogram is a way to visualize a frequency table. The datapoints can be binned to n number of bins.

sn.histplot(df_pl, x='humidity', bins=60)

sn.histplot(df_pl, x='temp_c', bins=60)

Distplot

Density plots show the distribution of the data values as a continuous line. It is a smoothed version of a histogram and is calculated via a kernel density estimate (page 24 - Practical Statistics for Data Scientists). The main difference between an histogram and a distplot is the y-axis scale. Displot uses probability scale and histogram uses frequency / count of observations.

sn.displot(df_pl, x='humidity', kind="kde")

sn.displot(df_pl, x='temp_c', kind="kde")

But remember our data has three different sensors. Would be great to ensure we can see the probablity density function of each one seperately. But this will fall in Bivariate data analysis, as we are using another column to make sense of the humidity or temp_c column.

sn.kdeplot(df_pl, x='humidity', hue='device')

Boxplot

Boxplot give a quick way to visualize the distribution of data and provide boundaries to the observation outside which we find the outliers. It also shows the 25th percentile, median, 75th percentile.

The visual also helps quickly notice the outliers in the observations on both lower and higher boundries.

This is also called as a 5 point summary visual.

sn.boxplot(df_pl, x='humidity')

sn.boxplot(df_pl, x='temp_c')

Bivariate data analysis

Scatterplot (Numerical to Numerical)

sn.scatterplot(df_pl, x='temp_c', y='humidity' )

Barplot (Numerical to categorical)

sn.barplot(df_pl, x='temp_c', y='device')

Boxplot (Numerical to categorical)

sn.boxplot(df_pl, x='temp_c', y='device')

Multivariate data analysis

sn.scatterplot(df_pl, x='temp_c', y='humidity', hue='device' )

# Convert Polars DataFrame to pandas for seaborn plotting
df = df_pl.to_pandas()

# Get unique devices (you want 3 scatterplots, one for each device)
devices = df['device'].unique()

# Create a figure with 1 row and 3 columns for side-by-side plots
fig, axes = plt.subplots(1, 3, figsize=(18, 5), sharex=True, sharey=True)

# Loop through devices and axes to plot each device's scatterplot
for ax, device in zip(axes, devices):
    subset = df[df['device'] == device]
    sn.scatterplot(data=subset, x='temp_c', y='humidity', ax=ax)
    ax.set_title(f'Device: {device}')
    ax.set_xlabel('Temperature (°C)')
    ax.set_ylabel('Humidity')

plt.tight_layout()
plt.show()

Text(0.5, 1.0, 'Device: b8:27:eb:bf:9d:51')

Text(0.5, 0, 'Temperature (°C)')

Text(0, 0.5, 'Humidity')

Text(0.5, 1.0, 'Device: 00:0f:00:70:91:0a')

Text(0.5, 0, 'Temperature (°C)')

Text(0, 0.5, 'Humidity')

Text(0.5, 1.0, 'Device: 1c:bf:ce:15:ec:4d')

Text(0.5, 0, 'Temperature (°C)')

Text(0, 0.5, 'Humidity')

Automated EDA using pandas-profiling

This is a quick way to perform EDA and study the data. However, this only works on pandas dataframe. So if we have very large datasets, this approach may or may not be performant.

That said, this approach eliminates most of the hassle in understanding the input data.

It lacks in some visualization, which need to be performed to get a better idea of the data.

from ydata_profiling import ProfileReport
profile = ProfileReport(df_pl.to_pandas(), title="Data Profiling Report")
profile.to_notebook_iframe()