Blog #12: Time-driven Swipe Gesture Detection Library

2 Jan 2023

Blog #12: Time-driven Swipe Gesture Detection Library

I welcome you to my 12^th blog post as part of Experimenting with Gesture Sensors competition. Currently there are previous 11 blog posts which I used for describing my first steps with gesture sensor and describing my path to complete my first project which I shown in 9^th blog post. Now I am in last phase and this and next blog post will describe my own gesture detection algorithm which I created for completing my second gesture-controlled game. If you missed some of my blogs, fee free to read them now. Here is list of links to all previous blog posts:

MAX25405 Operation

For describing what is goal of this and next blog post it is good to repeat how do the MAX25405 sensor work. MAX25405 is optical sensor. It has array of 60 pixels which are sensitive to infra-red light. Infra-red light is invisible to people but still behave like a light and sensors and cameras are sensitive to this light also. Evaluation Kit of this sensor which I use as part of competition have 4 onboard infra-red LED diodes which emits IR light. If you place hand to the front of LED and sensor some light emitted from the LEDs reflect from your hand back to the sensor and sensor sense it. If you place your hand out, then less light arrives back to the sensor. MAX25405 have 60 of these sensors formed to the matrix, so if you move your hand, then reflected amount will be more saturated under your hand. You can see this behaviour on the following video (the same video was in my blog #6 which contains more details about it):

The difference between MAX25405 and competitors is that some competitors have integrated engine which process data from the sensors and then returns gesture information on the output. But MAX25405 instead have almost no engine and instead give you RAW data which you have seen on video above.

As you have seen the data from the sensor can give you location of hand. The conversions of these 60 values to the information that hand was moved from left to right is not trivial. And this and next blog posts are exactly about it. About converting 60 values within some time period to meaningful information saying that these values correspond to moving hand from left to right (or other directions of course).

I implemented two algorithms. Some logic was very similar in both algorithms but at higher layer they were different. At first, I developed algorithm which will describe in this blog post. It took me several days to develop my first gesture detection algorithm and tweak it. But when I was tuning algorithm, I was also tuning my hand to make gestures exactly as expected by algorithm. Later when I deployed my first algorithm to the application and started testing it, I realized that my first algorithm is really trash and error rate is very high. I spend several other days with trying to fix it by changing some constants, modifying some minor logic and so on, but later I realized that it do not make sense to spend other time with it and instead I utilized all gained skills and created second algorithm differently. The second algorithm was much better and later I used it in my second project which I will describe sometime later. In this blog I will describe my first algorithm and in next blog I will describe my second algorithm.

Inspiration

I wrote my algorithm completely from scratch, but many ideas are inspired from GUI software for evaluation kit and its parameters and also some ideas come from Maxim Firmware framework which is incomplete project offered by Maxim to download at MAX25405 page. This project contains some preprocessing logic, but it is missing code required for proper gesture detection.

Steps

Generally, my algorithm is based on two stages:

Data pre-processing
Gesture classification

In following sections I will describe all steps of both stages which I implemented in my first algorithm.

Preprocessing step #1: Center of Mass

The first step in almost every gesture detection algorithm related to MAX25405 which I have seen was converting 60 pixel values to the singe coordinate. Algorithm for doing this is referred as centre of mass and it should compute float coordinates x and y of point in which largest number of measured values are accumulated. Idea is that you compute sum of multiplications coordinate with pixel value, then diveides by sum of pixel values (not multipled by coordinate). You compute this independently for both coordinates x and y.

This algorithm was also implemented by Maxim Integrated in their Firmware Framework package:

I implemented this algorithm with some minor tweaks like mitigating negative values which MAX25405 produce. On the following video you can see the pixel values with intensity highlighted by green color and red dot is computed center of mass:

Caveats of canter of mass implementation

The algorithm is not optimal, I think. You maybe noticed it on the video, but when real center of mass is near edge, then red dot do not reflect its position and it is biased to the center of screen. It is highlighted for example in following situation:

Reason for this behaviour is that all pixels act when computing sum. Value 46 in the right top corner also act in this computation and because it is non-zero, it moves center of mass to the right. Because there are lot of pixels with non-zero value on right side, they move X location to the right direction. The similar apply on other side. If you place hand in right side, then non-zero but small left pixels will move it to the center also. And finally, the same apply for Y coordinate. In fact, the issue is significant for Y coordinate because there is less pixels in Y direction (6 vs 10 in horizontal direction).

For this reason, in my algorithm, I implemented improved center of mass algorithm. I added threshold and all values lower than threshold I interpreted as zero.

Preprocessing step #2: Digital Filtering

As a next step I decided to implement digital filtering. If you remember my third blog in which I described Maxim’s evaluation software, you maybe remember that there were two options:

Low pass filter filters high frequencies which are generally noise in case of MAX25405 data. The more interesting is high pass filter. This filter removes bias. It subtracts static offset and highlights only fast and large changes. And this is what we need. We want to see moving subjects on the scene, not the static one. It is good for compensating all static objects in the scene. Static objects in the scene are for example t-shirt of the user or in my case when I was recording my project it was my hand with phone. Really, if you place camera to the scene, after few seconds it disappears because of this filter. All these static objects can be filtered using high-pass filters and gesture detection can work even when they are on the scene.

There are no details about filter implementation in GUI software and coefficients meaning. I implemented both filters as simple FIR filters. Current sample is weighed by the coefficient and previous result is weighed by one minus coefficient. I first apply low pass filter, then the output pass to the high pass filter.

After filtering input dramatically change but processing using center of mass work after filtering still work well but it is not affected by hidden objects and most probably is less affected by noise. Filters also remove bias, so thresholding for detection presence of moving object is now much easier because negative values ranging from -1800 to -400 are all now values near zeros.

Preprocessing step #3: Memory

Algorithm is naturally time-dependent. From single picture you can’t detect any gesture. You need to collect at least two frames and then you can detect something. I did not limit myself to two frames and collected 50 frames instead. Program stored 50 previous frames. In fact, I was not storing raw frames but rather I stored preprocessed data. I stored 50 last:

X and Y coordinates of center of mass
Maximum value of preprocessed pixel. This means largest intensity of the pixel but note that pixel values were preprocessed by digital filters. Maximum value means something like maximum change of the value when I take in account high pass filter.

And this is all from preprocessing stage. In next stage I analysed preprocessed data and evaluated its gesture.

Classification step #1: Direction score

For detecting gesture from preprocessed data, I implemented algorithm in which I computed score for each possible direction (up, down, left right). At beginning I computed directions for each step between two consecutive frames by differentiating coordinates between two consecutive center of mass points. Based on these differences I selected one direction and I also known distance travelled by hand (in pixels of course).

Later I implemented basic filtering here. In case that gesture is not continuous, I ignore the score for this step. This happens when for example I see three consecutive steps from top, then right, then top. In this situation algorithm is not sure if it should increment score of top or right direction and instead it will ignore these samples.

Score is computed as multiplication of squared distance and intensity of measured values for both points. Higher distance gets higher score, and similarly more intensive points get higher score. In opposition points with lover values get lower score. They are usually collected at beginning and end of gesture when user is moving hand into and outside field of view. From practical experiments they were most frequently cause of noise and I decided to penalize them by this score.

After computation direction with highest score is selected. Last detection is used for handling two directions with very similar score. In this case algorithm do not provide any gesture information because it is not sure.

Classification step #2: Output glitches filtering

Output from this algorithm is not directly used for application because it is affected by lot of glitches and spurious gestures. They happen for example when user is beginning the gesture and move his/her hand to the field of view. In this stage for short period of time algorithm usually returns top or bottom depending on direction user move his hand to the field of view from.

This output filtering has additional memory and provides gesture if and only if there are sufficient number of consecutive detections. I compensated difference between vertical and horizontal number of pixels and for detecting gestures from left to right and vice versa I need more samples (because swipe is long as many as 10 pixels) than for gesture between top and bottom side (which cannot be longer than 6 pixels).

After detecting gesture, I block detection for a short period of time. This is used for preventing glitches on the end of gesture which occurs for a similar reason like on the beginning. Frequently it was reverse gesture. When user is moving hand outside field of view it is frequently detected as other misleading gesture.

Testing

For testing I was using MAX78000 MCU and Bridget BT817 display controller for rendering pixel and algorithm inputs on screen. On the following video you can see outputs from my algorithm:

Implementing to the game

After developing this algorithm, I tried to implement it to the pacman which I ported to MAX78000 MCU from my previous project in the meantime.

But the result was terrible. I tried to play pacman with my algorithm described above and I was mostly unsuccessful. Algorithm has several caveats. I tried to improve it by tweaking constants but even after several hours of experimenting I was unsuccessful. At the end I realized one very important thing: While I was developing algorithm and testing it I improved not only the algorithm but also my hand. At the end my had was making beautiful gestures properly accepted by algorithm. On the video above you have seen that it in most cases it worked! But when I start playing game (which is stress situation when three ghosts are behind you!) while holding camera on phone, my hand started issuing much less accurate gestures and my algorithm completely failed.

At the end I decided to throw this algorithm away and reused my experiences from this algorithm in my second attempt which I will describe in next blog. Many parts of my first and second algorithm are shared or only slightly modified, so it was not totally waste of time.

Conclusion

This is all from my blog. In this blog I described my first attempt to make gesture detection algorithm. I never write this kind of algorithms before. It is more complicated than it looks at beginning. I learned a lot. I gained many experiences form preprocessing data including very interesting FIR filters which was used in the algorithm. At the end this algorithm was not very good, but many its concepts are reused in my later algorithm which I will describe in next blog.

Thank you for your attention and stay tuned for my latest blogs.

Next blog: Blog #13: Improved Time-driven Swipe Gesture Detection Library

dougw over 3 years ago

Would it help if you used a black wand with a small, bright IR tip?
- Cancel
- Vote Up 0 Vote Down
- Sign in to reply
- More
- Cancel