I had contemplated using either the camera or microphone on the Nicla Vision to do image or audio classification to monitor the health of a bee hive. The challenge in either case is to build or obtain a labeled dataset in order to create an ML model for inferencing. In the timeframe of this Design Challenge I can't realistically acquire enough data using the Nicla Vision to build a dataset. For my project I decided on using bee hive sounds, so I set about searching for a suitable audio dataset.
ChatGPT
This seemed like a good opportunity to try out ChatGPT, so I submitted a query "find dataset of bee sounds to classify health". And it returned results that looked very promising:
" There are several datasets available online that contain bee sounds that could be used to classify the health of bees. Some examples include:
- The "BeeSounds" dataset: This dataset, published by researchers at Aberystwyth University, contains recordings of honeybees and bumblebees, along with annotations indicating the species and the presence of disease. The dataset can be found here: https://data.mendeley.com/datasets/5ctjnf2c6k/1
- The "BeeHealth" dataset: This dataset, created by researchers at the University of Sussex, contains recordings of honeybees, bumblebees, and solitary bees, along with annotations indicating the health status of the bees. The dataset can be found here: https://data.mendeley.com/datasets/v5d5c5z5c9/1
- The "BeeSounds-2018" dataset: This dataset contains recordings of honeybees, bumblebees, and solitary bees, along with annotations indicating the species and the presence of disease. The dataset can be found here: https://data.mendeley.com/datasets/kxncwwcxzc/1
It is important to note that using bee sounds to classify the health of bees is a relatively new and challenging task, so it is likely that there are not many datasets available yet and you may need to create your own dataset as well."
Unfortunately, the links to data.mendeley.com do not work. And it is also not clear to me whether that data is publicly accessible. At least ChatGPT acknowledged that it gave me a bad link:
" I apologize for the confusion. I apologize for the error in my previous response, the link https://data.mendeley.com/datasets/kxncwwcxzc/1 seems to be unavailable. It is possible that the dataset is no longer accessible, or that the link has been changed.
I suggest you to look for other datasets that might have bee sounds, and also check if they are labeled as healthy or unhealthy. Some other websites that might help you are:
You can also look for research papers that have used bee sound dataset and contact the authors to see if they can share the dataset with you.
Please let me know if there is anything else I can assist you with. "
I tried searching for the datasets using some of the keywords, but had no success. The dataset from the University of Sussex sounds like exactly what I need, so I tried ChatGPT again with a more prescriptive query and it responded with an even more promising link:
" The University of Sussex BeeHealth dataset is a collection of data related to bee health and behavior, including measurements of bee activity and behavior, as well as information on the health and survival of bee colonies.
To access the BeeHealth dataset, you can visit the University of Sussex's BeeLab website at the following URL: https://www.sussex.ac.uk/beehealth/data. On this page, you can find links to download the BeeHealth dataset, as well as information on how the data was collected and processed.
The BeeHealth dataset consists of two main components: the Hive Monitoring dataset and the Apiary Survey dataset. The Hive Monitoring dataset contains data on bee activity and behavior, as well as environmental conditions within the hive, collected using sensors placed within beehives. The Apiary Survey dataset contains information on the health and survival of bee colonies, as well as data on the characteristics of the bees and their environment, collected through surveys of beekeepers.
Both datasets are available for download in CSV format. The Hive Monitoring dataset contains data from 24 beehives over a period of several years, while the Apiary Survey dataset contains data from over 500 beekeepers across the UK. "
Unfortunately, that link did not work either. So, I've resorted to sending an email query to the Laboratory of Apiculture and Social Insects (LASI) at the University of Sussex. I hope that I get a response. ChatGPT seems to be able to dredge up a lot more info than a normal Google search. It's too bad that it seems to find a lot of broken links.
Zenodo
I did find a dataset on Zenodo - To bee or not to bee: An annotated dataset for beehive sound recognition that I think I can use to build and deploy an ML model to the Nicla Vision. This is a simpler dataset as it classifies the sound of bees in the hive versus external sounds (not bee) recorded in the hive. Not quite what I wanted, but hopefully I can use it to demonstrate building and deploying a model.
This data was taken from hives in the Open Source Beehive (OSBH) and NU-Hive projects.
Utah State University
I found a paper by Vladimir Kulyukin of Utah State University - Audio, Image, Video, and Weather Datasets for Continuous Electronic Beehive Monitoring that used Raspberry Pi based monitoring systems collecting data from Utah beehives. Even though this paper was published in 2021, I still had problems with the data links - but I contacted the author and was able to acquire some of the audio datasets. These datasets are similar to the "To bee or not to bee" dataset in that there are only three classes - bee, cricket, noise. The segmentation of these data sets will make them easier to use with Edge Impulse, so I think I will start with one of them to test the model build and deployment process.
Here is the file structure for a dataset:
And the segment file labeling:
I'll describe the Edge Impulse process in a separate post. I received a UPS notification that a delivery from Newark is scheduled for February 21, so hopefully I'll have some hardware to try this on soon .