Data ingestion

Loading data using different python modules
datascience
fundamentals
Author

Jeevith Hegde

Published

August 9, 2025

Getting the data

The location of the data depends on ones use-case. Some have it locally, others on the cloud in a storage bucket or database. There is always a way to get your data to your development environment. The way we get it will differ.

In this notes, I use the kagglehub module to get a time series dataset.

import kagglehub
# Download latest version
path = kagglehub.dataset_download("garystafford/environmental-sensor-data-132k")
print("Path to dataset files:", path)

Decrease data size

One needs to evaluate the size of the dataset and the resources available to process the data. One way og limiting the size of the data is to use effecient file formats.

Data formatted as Comma Seperated Value (CSV) is everywhere, but it is not the most lighweight or fast format when it comes read/write from disks. So it is wise to convert large CSV files to formats which are faster and take lesser space on disk/memory. One such format is parquet.

We can always decrease the data size to make ingestion easier. So we use polars to convert it to paraquet format.

%%time
import polars as pl

input_data_path = f"../data/iot/iot_telemetry_data.parquet"
df = pl.scan_csv("../data/iot/iot_telemetry_data.csv")
df.sink_parquet(input_data_path)  # Saves the lazyframe as a parquet file
CPU times: user 720 ms, sys: 83.5 ms, total: 803 ms
Wall time: 387 ms

For small size datasets pandas and polars will do fine. As the dataset size increases, we need to look for effecient ways to read and prosess our data. In python, DuckDb and Pyspark are the best performing ETL libraries for large datasets.

That said, output from both DuckDb and Pyspark are not directly compatible with visualization libraries or other third party modules, for example, pandas-profiling.

So a hybrid approach is required, where the transformations are made using DuckDb or Pyspark, but the output is later converted to either polars or pandas dataframes. This allows us to efficiently perform ETL operations, but still be compatible with visualization libraries via polars or pandas formats.

Using pandas

%%time
import pandas as pd

pd_df =  pd.read_parquet(input_data_path)
pd_df.head(2)
CPU times: user 139 ms, sys: 35.5 ms, total: 175 ms
Wall time: 78.4 ms
ts device co humidity light lpg motion smoke temp
0 1.594512e+09 b8:27:eb:bf:9d:51 0.004956 51.0 False 0.007651 False 0.020411 22.700000
1 1.594512e+09 00:0f:00:70:91:0a 0.002840 76.0 False 0.005114 False 0.013275 19.700001

Using Polars

%%time
import polars as pl

pl_df =  pl.scan_parquet(input_data_path)
pl_df.head(2).collect()
CPU times: user 9.14 ms, sys: 1.99 ms, total: 11.1 ms
Wall time: 21.5 ms
shape: (2, 9)
ts device co humidity light lpg motion smoke temp
f64 str f64 f64 bool f64 bool f64 f64
1.5945e9 "b8:27:eb:bf:9d:51" 0.004956 51.0 false 0.007651 false 0.020411 22.7
1.5945e9 "00:0f:00:70:91:0a" 0.00284 76.0 false 0.005114 false 0.013275 19.700001

Using Duckdb

%%time
import duckdb 

result = duckdb.sql(f"SELECT * FROM '{input_data_path}'")
result.show()
┌────────────────────┬───────────────────┬───────────────────────┬───────────────────┬─────────┬───────────────────────┬─────────┬──────────────────────┬────────────────────┐
│         ts         │      device       │          co           │     humidity      │  light  │          lpg          │ motion  │        smoke         │        temp        │
│       double       │      varchar      │        double         │      double       │ boolean │        double         │ boolean │        double        │       double       │
├────────────────────┼───────────────────┼───────────────────────┼───────────────────┼─────────┼───────────────────────┼─────────┼──────────────────────┼────────────────────┤
│ 1594512094.3859746 │ b8:27:eb:bf:9d:51 │  0.004955938648391245 │              51.0 │ false   │   0.00765082227055719 │ false   │  0.02041127012241292 │               22.7 │
│ 1594512094.7355676 │ 00:0f:00:70:91:0a │ 0.0028400886071015706 │              76.0 │ false   │  0.005114383400977071 │ false   │ 0.013274836704851536 │ 19.700000762939453 │
│ 1594512098.0735729 │ b8:27:eb:bf:9d:51 │  0.004976012340421658 │              50.9 │ false   │  0.007673227406398091 │ false   │  0.02047512557617824 │               22.6 │
│  1594512099.589146 │ 1c:bf:ce:15:ec:4d │  0.004403026829699689 │ 76.80000305175781 │ true    │  0.007023337145877314 │ false   │ 0.018628225377018803 │               27.0 │
│  1594512101.761235 │ b8:27:eb:bf:9d:51 │  0.004967363641908952 │              50.9 │ false   │  0.007663577282372411 │ false   │ 0.020447620810233658 │               22.6 │
│ 1594512104.4684107 │ 1c:bf:ce:15:ec:4d │  0.004391003954583357 │  77.9000015258789 │ true    │  0.007009458543138704 │ false   │  0.01858890754005078 │               27.0 │
│ 1594512105.4488637 │ b8:27:eb:bf:9d:51 │  0.004976025118224167 │              50.9 │ false   │  0.007673241660297752 │ false   │ 0.020475166204362245 │               22.6 │
│  1594512106.869076 │ 00:0f:00:70:91:0a │ 0.0029381156266604295 │              76.0 │ false   │  0.005241481841731117 │ false   │ 0.013627521132019194 │ 19.700000762939453 │
│ 1594512108.2753816 │ 1c:bf:ce:15:ec:4d │  0.004345471359573249 │  77.9000015258789 │ true    │  0.006956802377235561 │ false   │  0.01843978190211682 │               27.0 │
│ 1594512109.1366868 │ b8:27:eb:bf:9d:51 │ 0.0049702557644185795 │              50.9 │ false   │ 0.0076668047981169295 │ false   │ 0.020456819607064126 │               22.6 │
│          ·         │         ·         │            ·          │                ·  │  ·      │           ·           │   ·     │           ·          │                 ·  │
│          ·         │         ·         │            ·          │                ·  │  ·      │           ·           │   ·     │           ·          │                 ·  │
│          ·         │         ·         │            ·          │                ·  │  ·      │           ·           │   ·     │           ·          │                 ·  │
│  1594528617.316101 │ 1c:bf:ce:15:ec:4d │  0.004046021057852555 │ 76.80000305175781 │ true    │ 0.0066065967976340415 │ false   │ 0.017449961409724143 │ 25.700000762939453 │
│ 1594528620.1956518 │ 00:0f:00:70:91:0a │  0.002656364232179544 │ 75.80000305175781 │ false   │  0.004872844318332629 │ false   │  0.01260624929176923 │ 19.399999618530273 │
│ 1594528620.5726128 │ b8:27:eb:bf:9d:51 │  0.004838697447060515 │              52.7 │ false   │  0.007519458225661741 │ false   │ 0.020037135448620128 │               22.1 │
│  1594528623.834102 │ 1c:bf:ce:15:ec:4d │   0.00405307240177605 │ 76.80000305175781 │ true    │  0.006614923978741507 │ false   │ 0.017473456575501277 │ 25.700000762939453 │
│ 1594528624.2579694 │ b8:27:eb:bf:9d:51 │  0.004859394667370261 │              52.6 │ false   │  0.007542712063759859 │ false   │ 0.020103331890812447 │               22.0 │
│   1594528626.16313 │ 00:0f:00:70:91:0a │  0.002612589347788125 │ 75.69999694824219 │ false   │  0.004814621044662395 │ false   │ 0.012445419108693902 │ 19.399999618530273 │
│ 1594528627.9447663 │ b8:27:eb:bf:9d:51 │ 0.0048428875642098305 │              52.6 │ false   │  0.007524168143299383 │ false   │  0.02005054199296001 │               22.0 │
│ 1594528631.8750868 │ b8:27:eb:bf:9d:51 │  0.004855276661403902 │              52.6 │ false   │  0.007538087568901512 │ false   │ 0.020090166282918154 │               22.0 │
│ 1594528631.9181976 │ 1c:bf:ce:15:ec:4d │   0.00405307240177605 │ 76.80000305175781 │ true    │  0.006614923978741507 │ false   │ 0.017473456575501277 │ 25.700000762939453 │
│  1594528635.561384 │ b8:27:eb:bf:9d:51 │     0.004833237960391 │              52.6 │ false   │  0.007513319775311182 │ false   │ 0.020019663516957425 │               22.0 │
├────────────────────┴───────────────────┴───────────────────────┴───────────────────┴─────────┴───────────────────────┴─────────┴──────────────────────┴────────────────────┤
│ ? rows (>9999 rows, 20 shown)                                                                                                                                    9 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

CPU times: user 160 ms, sys: 19.1 ms, total: 179 ms
Wall time: 197 ms

Using pyspark

%%time
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("readParquet").getOrCreate()
df = spark.read.parquet(input_data_path)
df.show()
+--------------------+-----------------+--------------------+-----------------+-----+--------------------+------+--------------------+------------------+
|                  ts|           device|                  co|         humidity|light|                 lpg|motion|               smoke|              temp|
+--------------------+-----------------+--------------------+-----------------+-----+--------------------+------+--------------------+------------------+
|1.5945120943859746E9|b8:27:eb:bf:9d:51|0.004955938648391245|             51.0|false| 0.00765082227055719| false| 0.02041127012241292|              22.7|
|1.5945120947355676E9|00:0f:00:70:91:0a|0.002840088607101...|             76.0|false|0.005114383400977071| false|0.013274836704851536|19.700000762939453|
|1.5945120980735729E9|b8:27:eb:bf:9d:51|0.004976012340421658|             50.9|false|0.007673227406398091| false| 0.02047512557617824|              22.6|
| 1.594512099589146E9|1c:bf:ce:15:ec:4d|0.004403026829699689|76.80000305175781| true|0.007023337145877314| false|0.018628225377018803|              27.0|
| 1.594512101761235E9|b8:27:eb:bf:9d:51|0.004967363641908952|             50.9|false|0.007663577282372411| false|0.020447620810233658|              22.6|
|1.5945121044684107E9|1c:bf:ce:15:ec:4d|0.004391003954583357| 77.9000015258789| true|0.007009458543138704| false| 0.01858890754005078|              27.0|
|1.5945121054488637E9|b8:27:eb:bf:9d:51|0.004976025118224167|             50.9|false|0.007673241660297752| false|0.020475166204362245|              22.6|
| 1.594512106869076E9|00:0f:00:70:91:0a|0.002938115626660...|             76.0|false|0.005241481841731117| false|0.013627521132019194|19.700000762939453|
|1.5945121082753816E9|1c:bf:ce:15:ec:4d|0.004345471359573249| 77.9000015258789| true|0.006956802377235561| false| 0.01843978190211682|              27.0|
|1.5945121091366868E9|b8:27:eb:bf:9d:51|0.004970255764418...|             50.9|false|0.007666804798116...| false|0.020456819607064126|              22.6|
| 1.594512112798518E9|b8:27:eb:bf:9d:51|0.004960208655965963|             50.9|false|0.007655590313556344| false| 0.02042485815208522|              22.6|
|1.5945121152885423E9|1c:bf:ce:15:ec:4d| 0.00438304383734993|             78.0| true|0.007000264000767255| false|0.018562862485791535|              27.0|
|1.5945121164982603E9|b8:27:eb:bf:9d:51|0.004971644949355083|             50.9|false|0.007668354899155367| false|0.020461237669931027|              22.6|
|1.5945121190980136E9|1c:bf:ce:15:ec:4d|0.004451497630812575|             78.0| true|0.007079183500131396| false|0.018786490564423525|              27.0|
| 1.594512120184931E9|b8:27:eb:bf:9d:51|0.004964564518477901|             50.9|false|0.007660453055613286| false|0.020438716650667384|              22.6|
|1.5945121227857318E9|00:0f:00:70:91:0a|0.002905014756555...|75.80000305175781|false|0.005198697479294309| false|0.013508733329556249|19.700000762939453|
|1.5945121238726196E9|b8:27:eb:bf:9d:51|0.004975983419764024|             50.9|false|0.007673195144776...| false| 0.02047503362023219|              22.6|
|1.5945121275601885E9|b8:27:eb:bf:9d:51|0.004960208655965963|             50.9|false|0.007655590313556344| false| 0.02042485815208522|              22.6|
| 1.594512129374153E9|1c:bf:ce:15:ec:4d|0.004439322766059633| 77.9000015258789| true|0.007065171934738014| false| 0.01874677460984377|              27.0|
|1.5945121312478008E9|b8:27:eb:bf:9d:51|0.004956119201656337|             50.9|false|0.007651023905784...| false|0.020411844733363067|              22.6|
+--------------------+-----------------+--------------------+-----------------+-----+--------------------+------+--------------------+------------------+
only showing top 20 rows
CPU times: user 279 ms, sys: 36.5 ms, total: 315 ms
Wall time: 11.5 s