Data ingestion – Automata

Getting the data

The location of the data depends on ones use-case. Some have it locally, others on the cloud in a storage bucket or database. There is always a way to get your data to your development environment. The way we get it will differ.

In this notes, I use the kagglehub module to get a time series dataset.

import kagglehub
# Download latest version
path = kagglehub.dataset_download("garystafford/environmental-sensor-data-132k")
print("Path to dataset files:", path)

Decrease data size

One needs to evaluate the size of the dataset and the resources available to process the data. One way og limiting the size of the data is to use effecient file formats.

Data formatted as Comma Seperated Value (CSV) is everywhere, but it is not the most lighweight or fast format when it comes read/write from disks. So it is wise to convert large CSV files to formats which are faster and take lesser space on disk/memory. One such format is parquet.

We can always decrease the data size to make ingestion easier. So we use polars to convert it to paraquet format.

%%time
import polars as pl

input_data_path = f"../data/iot/iot_telemetry_data.parquet"
df = pl.scan_csv("../data/iot/iot_telemetry_data.csv")
df.sink_parquet(input_data_path)  # Saves the lazyframe as a parquet file

CPU times: user 720 ms, sys: 83.5 ms, total: 803 ms
Wall time: 387 ms

For small size datasets pandas and polars will do fine. As the dataset size increases, we need to look for effecient ways to read and prosess our data. In python, DuckDb and Pyspark are the best performing ETL libraries for large datasets.

That said, output from both DuckDb and Pyspark are not directly compatible with visualization libraries or other third party modules, for example, pandas-profiling.

So a hybrid approach is required, where the transformations are made using DuckDb or Pyspark, but the output is later converted to either polars or pandas dataframes. This allows us to efficiently perform ETL operations, but still be compatible with visualization libraries via polars or pandas formats.

Using pandas

%%time
import pandas as pd

pd_df =  pd.read_parquet(input_data_path)
pd_df.head(2)

CPU times: user 139 ms, sys: 35.5 ms, total: 175 ms
Wall time: 78.4 ms

	ts	device	co	humidity	light	lpg	motion	smoke	temp
0	1.594512e+09	b8:27:eb:bf:9d:51	0.004956	51.0	False	0.007651	False	0.020411	22.700000
1	1.594512e+09	00:0f:00:70:91:0a	0.002840	76.0	False	0.005114	False	0.013275	19.700001

Using Polars

%%time
import polars as pl

pl_df =  pl.scan_parquet(input_data_path)
pl_df.head(2).collect()

CPU times: user 9.14 ms, sys: 1.99 ms, total: 11.1 ms
Wall time: 21.5 ms

shape: (2, 9)

ts	device	co	humidity	light	lpg	motion	smoke	temp
f64	str	f64	f64	bool	f64	bool	f64	f64
1.5945e9	"b8:27:eb:bf:9d:51"	0.004956	51.0	false	0.007651	false	0.020411	22.7
1.5945e9	"00:0f:00:70:91:0a"	0.00284	76.0	false	0.005114	false	0.013275	19.700001

Using Duckdb

%%time
import duckdb 

result = duckdb.sql(f"SELECT * FROM '{input_data_path}'")
result.show()

┌────────────────────┬───────────────────┬───────────────────────┬───────────────────┬─────────┬───────────────────────┬─────────┬──────────────────────┬────────────────────┐
│         ts         │      device       │          co           │     humidity      │  light  │          lpg          │ motion  │        smoke         │        temp        │
│       double       │      varchar      │        double         │      double       │ boolean │        double         │ boolean │        double        │       double       │
├────────────────────┼───────────────────┼───────────────────────┼───────────────────┼─────────┼───────────────────────┼─────────┼──────────────────────┼────────────────────┤
│ 1594512094.3859746 │ b8:27:eb:bf:9d:51 │  0.004955938648391245 │              51.0 │ false   │   0.00765082227055719 │ false   │  0.02041127012241292 │               22.7 │
│ 1594512094.7355676 │ 00:0f:00:70:91:0a │ 0.0028400886071015706 │              76.0 │ false   │  0.005114383400977071 │ false   │ 0.013274836704851536 │ 19.700000762939453 │
│ 1594512098.0735729 │ b8:27:eb:bf:9d:51 │  0.004976012340421658 │              50.9 │ false   │  0.007673227406398091 │ false   │  0.02047512557617824 │               22.6 │
│  1594512099.589146 │ 1c:bf:ce:15:ec:4d │  0.004403026829699689 │ 76.80000305175781 │ true    │  0.007023337145877314 │ false   │ 0.018628225377018803 │               27.0 │
│  1594512101.761235 │ b8:27:eb:bf:9d:51 │  0.004967363641908952 │              50.9 │ false   │  0.007663577282372411 │ false   │ 0.020447620810233658 │               22.6 │
│ 1594512104.4684107 │ 1c:bf:ce:15:ec:4d │  0.004391003954583357 │  77.9000015258789 │ true    │  0.007009458543138704 │ false   │  0.01858890754005078 │               27.0 │
│ 1594512105.4488637 │ b8:27:eb:bf:9d:51 │  0.004976025118224167 │              50.9 │ false   │  0.007673241660297752 │ false   │ 0.020475166204362245 │               22.6 │
│  1594512106.869076 │ 00:0f:00:70:91:0a │ 0.0029381156266604295 │              76.0 │ false   │  0.005241481841731117 │ false   │ 0.013627521132019194 │ 19.700000762939453 │
│ 1594512108.2753816 │ 1c:bf:ce:15:ec:4d │  0.004345471359573249 │  77.9000015258789 │ true    │  0.006956802377235561 │ false   │  0.01843978190211682 │               27.0 │
│ 1594512109.1366868 │ b8:27:eb:bf:9d:51 │ 0.0049702557644185795 │              50.9 │ false   │ 0.0076668047981169295 │ false   │ 0.020456819607064126 │               22.6 │
│          ·         │         ·         │            ·          │                ·  │  ·      │           ·           │   ·     │           ·          │                 ·  │
│          ·         │         ·         │            ·          │                ·  │  ·      │           ·           │   ·     │           ·          │                 ·  │
│          ·         │         ·         │            ·          │                ·  │  ·      │           ·           │   ·     │           ·          │                 ·  │
│  1594528617.316101 │ 1c:bf:ce:15:ec:4d │  0.004046021057852555 │ 76.80000305175781 │ true    │ 0.0066065967976340415 │ false   │ 0.017449961409724143 │ 25.700000762939453 │
│ 1594528620.1956518 │ 00:0f:00:70:91:0a │  0.002656364232179544 │ 75.80000305175781 │ false   │  0.004872844318332629 │ false   │  0.01260624929176923 │ 19.399999618530273 │
│ 1594528620.5726128 │ b8:27:eb:bf:9d:51 │  0.004838697447060515 │              52.7 │ false   │  0.007519458225661741 │ false   │ 0.020037135448620128 │               22.1 │
│  1594528623.834102 │ 1c:bf:ce:15:ec:4d │   0.00405307240177605 │ 76.80000305175781 │ true    │  0.006614923978741507 │ false   │ 0.017473456575501277 │ 25.700000762939453 │
│ 1594528624.2579694 │ b8:27:eb:bf:9d:51 │  0.004859394667370261 │              52.6 │ false   │  0.007542712063759859 │ false   │ 0.020103331890812447 │               22.0 │
│   1594528626.16313 │ 00:0f:00:70:91:0a │  0.002612589347788125 │ 75.69999694824219 │ false   │  0.004814621044662395 │ false   │ 0.012445419108693902 │ 19.399999618530273 │
│ 1594528627.9447663 │ b8:27:eb:bf:9d:51 │ 0.0048428875642098305 │              52.6 │ false   │  0.007524168143299383 │ false   │  0.02005054199296001 │               22.0 │
│ 1594528631.8750868 │ b8:27:eb:bf:9d:51 │  0.004855276661403902 │              52.6 │ false   │  0.007538087568901512 │ false   │ 0.020090166282918154 │               22.0 │
│ 1594528631.9181976 │ 1c:bf:ce:15:ec:4d │   0.00405307240177605 │ 76.80000305175781 │ true    │  0.006614923978741507 │ false   │ 0.017473456575501277 │ 25.700000762939453 │
│  1594528635.561384 │ b8:27:eb:bf:9d:51 │     0.004833237960391 │              52.6 │ false   │  0.007513319775311182 │ false   │ 0.020019663516957425 │               22.0 │
├────────────────────┴───────────────────┴───────────────────────┴───────────────────┴─────────┴───────────────────────┴─────────┴──────────────────────┴────────────────────┤
│ ? rows (>9999 rows, 20 shown)                                                                                                                                    9 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

CPU times: user 160 ms, sys: 19.1 ms, total: 179 ms
Wall time: 197 ms

Using pyspark

%%time
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("readParquet").getOrCreate()
df = spark.read.parquet(input_data_path)
df.show()

+--------------------+-----------------+--------------------+-----------------+-----+--------------------+------+--------------------+------------------+
|                  ts|           device|                  co|         humidity|light|                 lpg|motion|               smoke|              temp|
+--------------------+-----------------+--------------------+-----------------+-----+--------------------+------+--------------------+------------------+
|1.5945120943859746E9|b8:27:eb:bf:9d:51|0.004955938648391245|             51.0|false| 0.00765082227055719| false| 0.02041127012241292|              22.7|
|1.5945120947355676E9|00:0f:00:70:91:0a|0.002840088607101...|             76.0|false|0.005114383400977071| false|0.013274836704851536|19.700000762939453|
|1.5945120980735729E9|b8:27:eb:bf:9d:51|0.004976012340421658|             50.9|false|0.007673227406398091| false| 0.02047512557617824|              22.6|
| 1.594512099589146E9|1c:bf:ce:15:ec:4d|0.004403026829699689|76.80000305175781| true|0.007023337145877314| false|0.018628225377018803|              27.0|
| 1.594512101761235E9|b8:27:eb:bf:9d:51|0.004967363641908952|             50.9|false|0.007663577282372411| false|0.020447620810233658|              22.6|
|1.5945121044684107E9|1c:bf:ce:15:ec:4d|0.004391003954583357| 77.9000015258789| true|0.007009458543138704| false| 0.01858890754005078|              27.0|
|1.5945121054488637E9|b8:27:eb:bf:9d:51|0.004976025118224167|             50.9|false|0.007673241660297752| false|0.020475166204362245|              22.6|
| 1.594512106869076E9|00:0f:00:70:91:0a|0.002938115626660...|             76.0|false|0.005241481841731117| false|0.013627521132019194|19.700000762939453|
|1.5945121082753816E9|1c:bf:ce:15:ec:4d|0.004345471359573249| 77.9000015258789| true|0.006956802377235561| false| 0.01843978190211682|              27.0|
|1.5945121091366868E9|b8:27:eb:bf:9d:51|0.004970255764418...|             50.9|false|0.007666804798116...| false|0.020456819607064126|              22.6|
| 1.594512112798518E9|b8:27:eb:bf:9d:51|0.004960208655965963|             50.9|false|0.007655590313556344| false| 0.02042485815208522|              22.6|
|1.5945121152885423E9|1c:bf:ce:15:ec:4d| 0.00438304383734993|             78.0| true|0.007000264000767255| false|0.018562862485791535|              27.0|
|1.5945121164982603E9|b8:27:eb:bf:9d:51|0.004971644949355083|             50.9|false|0.007668354899155367| false|0.020461237669931027|              22.6|
|1.5945121190980136E9|1c:bf:ce:15:ec:4d|0.004451497630812575|             78.0| true|0.007079183500131396| false|0.018786490564423525|              27.0|
| 1.594512120184931E9|b8:27:eb:bf:9d:51|0.004964564518477901|             50.9|false|0.007660453055613286| false|0.020438716650667384|              22.6|
|1.5945121227857318E9|00:0f:00:70:91:0a|0.002905014756555...|75.80000305175781|false|0.005198697479294309| false|0.013508733329556249|19.700000762939453|
|1.5945121238726196E9|b8:27:eb:bf:9d:51|0.004975983419764024|             50.9|false|0.007673195144776...| false| 0.02047503362023219|              22.6|
|1.5945121275601885E9|b8:27:eb:bf:9d:51|0.004960208655965963|             50.9|false|0.007655590313556344| false| 0.02042485815208522|              22.6|
| 1.594512129374153E9|1c:bf:ce:15:ec:4d|0.004439322766059633| 77.9000015258789| true|0.007065171934738014| false| 0.01874677460984377|              27.0|
|1.5945121312478008E9|b8:27:eb:bf:9d:51|0.004956119201656337|             50.9|false|0.007651023905784...| false|0.020411844733363067|              22.6|
+--------------------+-----------------+--------------------+-----------------+-----+--------------------+------+--------------------+------------------+
only showing top 20 rows
CPU times: user 279 ms, sys: 36.5 ms, total: 315 ms
Wall time: 11.5 s