From the previous module, we know where the students are sitting. So we have rough estimate of the hand location of each student.
Challenges:
1. Students sitting in front of the classroom appear bigger than the students at the back, so the size of the hands isn't uniform for everyone.
2. Some students are left and some are right handed, so we cannot assume anything here and need to look for both hands for every student.
3. Occlusion is a big problem for students sitting at the back benches. They might get occluded completely from the camera's view and would be quite tough to detect if they raise their hands.
Assumptions:
1. Hand is raised for more than a second (it correlates directly with the FPS of the image acquisition)
2. Camera is well positioned to capture the whole class (in bigger classes there might be issues on boundaries), this is important for some applying some heuristics which I'll explain later.
3. Students don't move during the class, it is easy to handle this scenario. But due to limited scope of my project I'll use this assumption.
Approach:
1. Pre-processing Step (using heuristics):
This is again needed for making the whole system work fast and give robust results. Some Heuristics like
i. remove faces at the ceiling/borders as they are most likely false positives
ii. aspect ratio of faces is approximately fixed, use it to filter other negatives.
2. Background Subtraction Stage:
Detecting and even like raising hands is an easy event in our scenario as we have the frames grabbed at every second. Hence if we just do a local background subtraction (near the previously found faces) then we know where the pixels have changed the intensity. The only thing to figure out is that whether it was a hand or not !
Diff_image = Current_frame - Previous_frame
3. Post-processing Stage:
The difference image isn't smooth due to the following reasons
i. light changes in the image capture creates a non-zero value at each pixel
ii. small movements of student's head or body creates a silhouette in background image
Above reasons are quite important to handle to get a proper detection of raised hand. To remove small pixel values and clusters/blobs of small patches, I used a simple morphological processing of erosion of image frames using a simple 'disk' as structuring material (Not going in details here). This step also helps in removing small movements of the body of students. I also used a heuristic of face location and hand's relative positioning which helps me in avoiding other background pixels being processing and hence I don't care about the values in those areas.
4. Hand Detection:
This part is the main think-tank of the whole system.
Main logic can again be divided into following parts
1. Use the size of the face to locate an approximate region of interest for both hand (I'll provide details later)
2. Use the above regions and check the value of post-processed difference image pixels in those locations. If there is a high sum value in the LHand or RHand region then the student has raised his/her hand.
3. To avoid the double counting of hands in similar region, I set the difference image pixels to zero after the above sum is calculated in the region.
Here is a snippet of the approach
5. Interesting scenario:
Background Image logic of hand detection can get flawed if I use absolute difference of the images, as it will have high values during the hand raising and hand dropping instants. Initially, I didn't thought about this scenario and was getting twice detection. Later, I changed the logic to just calculate the image difference and ignore the negative pixels. This simple hack solves the problem at minimal cost/analysis.