The location of the data depends on ones use-case. Some have it locally, others on the cloud in a storage bucket or database. There is always a way to get your data to your development environment. The way we get it will differ.
In this notes, I use the kagglehub module to get a time series dataset.
One needs to evaluate the size of the dataset and the resources available to process the data. One way og limiting the size of the data is to use effecient file formats.
Data formatted as Comma Seperated Value (CSV) is everywhere, but it is not the most lighweight or fast format when it comes read/write from disks. So it is wise to convert large CSV files to formats which are faster and take lesser space on disk/memory. One such format is parquet.
We can always decrease the data size to make ingestion easier. So we use polars to convert it to paraquet format.
%%timeimport polars as plinput_data_path =f"../data/iot/iot_telemetry_data.parquet"df = pl.scan_csv("../data/iot/iot_telemetry_data.csv")df.sink_parquet(input_data_path) # Saves the lazyframe as a parquet file
CPU times: user 720 ms, sys: 83.5 ms, total: 803 ms
Wall time: 387 ms
For small size datasets pandas and polars will do fine. As the dataset size increases, we need to look for effecient ways to read and prosess our data. In python, DuckDb and Pyspark are the best performing ETL libraries for large datasets.
That said, output from both DuckDb and Pyspark are not directly compatible with visualization libraries or other third party modules, for example, pandas-profiling.
So a hybrid approach is required, where the transformations are made using DuckDb or Pyspark, but the output is later converted to either polars or pandas dataframes. This allows us to efficiently perform ETL operations, but still be compatible with visualization libraries via polars or pandas formats.
Using pandas
%%timeimport pandas as pdpd_df = pd.read_parquet(input_data_path)pd_df.head(2)
CPU times: user 139 ms, sys: 35.5 ms, total: 175 ms
Wall time: 78.4 ms
ts
device
co
humidity
light
lpg
motion
smoke
temp
0
1.594512e+09
b8:27:eb:bf:9d:51
0.004956
51.0
False
0.007651
False
0.020411
22.700000
1
1.594512e+09
00:0f:00:70:91:0a
0.002840
76.0
False
0.005114
False
0.013275
19.700001
Using Polars
%%timeimport polars as plpl_df = pl.scan_parquet(input_data_path)pl_df.head(2).collect()
CPU times: user 9.14 ms, sys: 1.99 ms, total: 11.1 ms
Wall time: 21.5 ms
shape: (2, 9)
ts
device
co
humidity
light
lpg
motion
smoke
temp
f64
str
f64
f64
bool
f64
bool
f64
f64
1.5945e9
"b8:27:eb:bf:9d:51"
0.004956
51.0
false
0.007651
false
0.020411
22.7
1.5945e9
"00:0f:00:70:91:0a"
0.00284
76.0
false
0.005114
false
0.013275
19.700001
Using Duckdb
%%timeimport duckdb result = duckdb.sql(f"SELECT * FROM '{input_data_path}'")result.show()