Introduction
In this blog I will be training the Zero 2W with datasets of different sizes and monitor how much time it takes to train.
I have a Raspberry Pi 4B 2GB RAM. I will be comparing the Zero 2W board against it. This comparison is not focused on which board is better because without any doubt Pi 4B is better and stronger. Instead, the comparison is focused upon knowing the limitations of the Zero 2W board.
Dataset
I will be using a csv dataset which contains the global average temperatures over years at different cities and countries. The dataset can be found here. This dataset contains different csv files:
- GlobalLandTemperatureByCity (~508 MB)
- GlobalLandTemperatureByCountry (~21.6 MB)
- GlobalLandTemperatureByMajorCity (~13.4 MB)
- GlobalLandTemperatureByState (~29.3 MB)
- GlobalTemperature (~201 KB)
I will be using these datasets of different sizes and record the behavior of both the boards (Pi 4B and Zero 2W).
But first, I need to clean these datasets to remove unuseful information and reduce the size of the datasets. I will be using 'pandas' python package to read the csv files and drop certain columns which are of no use to me.
I dropped the columns 'Latitude' and 'Longitude' and converted the 'dt'(date) column into two columns 'Year' and 'Month' and changed the column names.
After modifying the datasets, the new sizes are:
- ByCity.csv (~246 MB) (Year, Month, AverageTemperature, City, Country)
- ByCountry.csv (~12.7 MB) (Year, Month, AverageTemperature, Country)
- ByMajorCity (~6.59 MB) (Year, Month, AverageTemperature, City, Country)
- ByState (~19.2 MB) (Year, Month, AverageTemperature, State, Country)
- GlobalAverage (~42.1 KB) (Year, Month, AverageTemperature)
Installing Software
I will be using scikit-learn package to implement linear regression with single input variable and multiple input variable on the datasets.
$ sudo apt update
$ sudo apt full-upgrade
$ sudo apt install python3-pip
$ pip install scikit-learn
$ pip insatll pandas
$ pip install numpy
After successful installations, we are set to move forward.
Training
I used the following code:
import time
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
#GLOBAL AVERAGE DATASET
df = pd.read_csv(r"data/GlobalAverage.csv")
X = df[["Year", "Month"]]
Y = df[["AverageTemperature"]]
start1 = time.time()
regressor = LinearRegression()
regressor.fit(X, Y)
end1 = time.time()
time1 = end1 - start1
print("GlobalAverage: ", time1)
del df, X, Y
#BY MAJOR CITY DATASET
df = pd.read_csv(r"data/ByMajorCity.csv")
X = df[["Year", "Month"]]
Y = df[["AverageTemperature"]]
start2 = time.time()
regressor = LinearRegression()
regressor.fit(X, Y)
end2 = time.time()
time2 = end2 - start2
print("ByMajorCity: ", time2)
del df, X, Y
#BY STATE DATASET
df = pd.read_csv(r"data/ByState.csv")
X = df[["Year", "Month"]]
Y = df[["AverageTemperature"]]
start3 = time.time()
regressor = LinearRegression()
regressor.fit(X, Y)
end3 = time.time()
time3 = end3 - start3
print("ByState: ", time3)
del df, X, Y
#BY CITY DATASET
df = pd.read_csv(r"data/ByCity.csv")
X = df[["Year", "Month"]]
Y = df[["AverageTemperature"]]
start4 = time.time()
regressor = LinearRegression()
regressor.fit(X, Y)
end4 = time.time()
time4 = end4 - start4
print("ByCity: ", time4)
del df, X, Y
I am using the 'time' package to note training time. I ran this same code on both the boards i.e. Pi 4B and Pi Zero 2W. The results are following:
For Raspberry Pi 4B (2GB RAM):
For Raspberry Pi Zero 2W:
Now, an interesting thing happened while running the ByCity.csv (~246 MB) dataset. The terminal said 'Killed'. I searched about this error on google and found that this happens when the system runs out of memory. It means that the dataset was too big for the Zero2W to load through pandas.
To know whether this problem occurred during reading the csv file or during training, I tried to open the csv file through a separate code using 'pandas.read_csv()'. But it showed the same 'Killed' message. It means that the file was too large for Zero2W to be loaded into its memory using pandas.
Then, I tried to open it with the python's inbuilt csv reader. And the file opened in an instant. But it didnt't open with pandas. Well then, we get an option of using the prebuilt csvreader library instead of pandas for training our model but the thing is that pandas is way better at managing the data. csvreader can be used to parse the csv file but we need pandas to do some analysis and modifications on our data.
I then decided to cut down the size of this dataset to know upto what size of the file can the board load into its memory using pandas. To reduce the size of the dataset, I used the train_test_split() function from the sklearn module. The code is given below:
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv(r"ByCity.csv")
X = df[["Year", "Month", "AverageTemperature"]]
Y = df[[]]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)
X_train.to_csv("ByCity.csv", index = False)
This function is actually used to split the dataset into two datasets: train dataset and test dataset. The parameter 'test_size' corresponds to the fraction of the dataset that needs to be converted into the test dataset. By increasing the value of this parameter, we can reduce the size of our train dataset.
Using this function, I created few datasets out of the main dataset with decreasing size and tried to load them on the board. I got the following results:
- Opened ~100 MB dataset using 'pandas.read_csv()' but couldn't train this model using sklearn. Showed the same error 'Killed'.
- Opened and trained ~90 MB dataset with single input regression (input : Year). Couldn't run this model for multi input regression (input: Year and Month). Showed the same error 'Killed'.
- Opened and trained ~67 MB dataset with multi input regression (input: Year and Month). I was able to train this dataset separately in a separate code. But when I tried to train it in my main code along with other datasets, it again gave the same error 'Killed'.
- Opened and trained ~56 MB dataset with multi input regression (input: Year and Month). This time it ran within the main program also.
After reaching 56 MB size, I was able to train the model. But it took considerably more time as compared to Pi 4B.
Conclusion:
Based on my observations, I think that the Raspberry Pi Zero2W board can be used for ML but only with smaller datasets of size upto 50 MB. For smaller datasets, there is not much difference between training times for Pi 4B and Pi Zero 2W. But the difference will keep increasing with the size of the dataset.
For datasets between 50 MB and 100 MB, it might show errors and might even run out of memory. The dataset might run with some program while showing errors with others.
Datasets greater than 100 MB will most probably show some errors and the system will run out of memory.