BEEhavior Investigation - Blog 1 - Dataset

26 Mar 2023

Unfortunately, my kits are currently held up at customs, and the cost to release them is almost as much as buying a new set. So, I have decided to purchase the MKRWAN 13310 continue with my project. While I await the arrival of the new equipment, I have already begun collecting sensory data like humidity and temperature but I will write them when I get MKRWAN13310. Despite this setback, hopefully I will achieve my goal to smart bee hive.

In this blog, I will talk about the dataset and some unsuccess story. On Kaggle, there are some dataset related to bee health monitoring. I have found Smart Bee Colony Monitor: Clips of Beehive Sounds interesting as it includes environmental data along with the audio collected from European Honey Bee hives in California. So I have decided to test queen presence by using only environmental data. There are some other sound data and image data.

This dataset includes 8 important features: hive temp, hive humidity, hive pressure, weather temp, weather humidity, weather pressure, wind speed, cloud coverage. There were more features but some doesn't change like (rain) and sum has so many blank fields. The data collected from 4 hives. So my initial idea was train and test an AI on hive 1 then apply the same network to others.

# Mount Google Drive to access CSV file
from google.colab import drive
drive.mount('/content/drive')

# Import necessary libraries
import pandas as pd

# Load CSV file into pandas dataframe
csv_path = '/content/drive/MyDrive/all_data_updated.csv'
df = pd.read_csv(csv_path)
data = pd.read_csv(csv_path)

# Print the column headers
print(data.columns)



from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Select only hive 1 data
hive_1_data = data[data['hive number'] == 1]

# drop rows with empty values
hive_1_data = hive_1_data.dropna(how='any')

# Define features and target
features = ['hive temp', 'hive humidity', 'hive pressure', 'weather temp', 'weather humidity',  'weather pressure'] #, 'wind speed', 'gust speed', 'cloud coverage', 'rain', 'lat', 'long'
target = 'queen presence'

# Split data into train and test sets
train_data, test_data, train_labels, test_labels = train_test_split(hive_1_data[features], hive_1_data[target], test_size=0.2, random_state=42)

# Initialize StandardScaler
scaler = MinMaxScaler()

# Fit and transform the training data
train_data = scaler.fit_transform(train_data)

# Transform the test data using the same scaler
test_data = scaler.transform(test_data)

# Create SVM model
model = SVC(kernel='linear')

# Train the model on the training data
model.fit(train_data, train_labels)

# Predict queen presence for the test data
predictions = model.predict(test_data)

# Calculate accuracy of predictions
accuracy = accuracy_score(test_labels, predictions)
print("Accuracy: {:.2f}%".format(accuracy*100))

# Compute confusion matrix
conf_matrix = confusion_matrix(test_labels, predictions)
print("Confusion matrix:")
print(conf_matrix)

# Select only hive 3 data
hive_3_data = data[data['hive number'] == 3]

# drop rows with empty values
hive_3_data = hive_3_data.dropna(how='any')



hive_3_data_test = hive_3_data[features]
hive_3_data_label = hive_3_data['queen presence']


# Split data into train and test sets
train_data, test_data, train_labels, test_labels = train_test_split(hive_1_data[features], hive_1_data[target], test_size=0.2, random_state=42)

# Predict queen presence for the test data
predictions = model.predict(hive_3_data_test)

# Calculate accuracy of predictions
accuracy = accuracy_score(hive_3_data_label, predictions)
print("Accuracy: {:.2f}%".format(accuracy*100))

# Compute confusion matrix
conf_matrix = confusion_matrix(hive_3_data_label, predictions)
print("Confusion matrix:")
print(conf_matrix)

If you upload the excel file to the your Google Drive account, you can run the code from this Colab link.

Although the accuracy was high (91.67% for the hive1 and 82.61% for hive2 using the same network), it miss all the cases where queen is missing, which means it is not working. I have tried two basic AI methods (Support Vector Machine and Random Forest Classifier) with and without normalisation (StandardScaler). I also change the parameters like only the hive data but there was no luck and I have some explanations for that. Firstly, data was unbalanced and there were very few cases of missing queen. The amount of the data was also not enough. It was total 1275 but there are also some empty ones that was omitted in the code. Finally, location of the sensor. I may explain this in detail in sensor blog but in short multiple sensors are required to have better understanding. If there is only one sensor it should be located where baby bees are. It is not guaranteed that I will get better results even if I try other methods. Maybe this dataset may not work for only sensory data. Because of this reasons I will not doing further tests on sensory data alone and I will focus on the audio data for this dataset .