Audio4Vision #5 - Show and Tell: More Iterations and Finetuning

12 Aug 2018

Welcome to our fifth blog post! This blog will be a little short, as we're essentially posting an update on the last blog.

We mentioned the Show and Tell model in that blog, and talked about its structure and capabilities of the model. We also mentioned that we will increase the number of iterations as well as finetune the model a little.

First, the outputs from the older model:

This model was trained for 1 million iterations.

Now, we trained a new model, this time with 2 million iterations, and performed some finetuning. These are the results:

As you can see, the probabilities have increased, and in general a clearer picture is being painted for our user.

That's all for this blog/update. The next blog will be up soon with more exciting updates so stay tuned!

Thank you for reading.

Top Comments

aspork42 over 7 years ago

Great update! I agree - there is a big difference between driving and being parked, but you have clearly explained what is happening and why that would be the case. Looking forward to seeing the project develop.
- Cancel
- Vote Up +1 Vote Down
- Sign in to reply
- More
- Cancel
dixonselvan over 7 years ago in reply to pranjalranjan299

That was an excellent explanation, now I understand things more clearly. Good luck on the development of the hybrid model.
- Cancel
- Vote Up +1 Vote Down
- Sign in to reply
- More
- Cancel
pranjalranjan299 over 7 years ago in reply to genebren

I would say the second model has been trained longer so it has overfit the dataset it was trained with. Hence it's giving out the caption of what a similar image would exist in its original training data. There are few solutions to differentiate between moving and stationary -

1) We can create another learning model which takes more frames of the same picture, so it detects the object at different coordinates in different pictures assuming the camera was stationary when the frames were taken. If the coordinates are the same in all the frames then it can be treated as not moving, otherwise, we can say it is moving. I think we can implement this after the first model is created because right now we are facing several obstacles in just deploying this on the cloud and linking it to MKR1000. So I think this additional feature can be added once we are done making a functioning unit which is capable of giving out fairly good descriptive captions.

2) The second solution could be using a sensor to detect movement in the line of sight of the camera, so the sensor can detect for a certain time period during which the image is taken by the camera. Then the caption is generated and at the end of the caption the words "moving" or "stationary" can be added depending on the values of the sensor(s) which detect a change in velocity or movement in the vicinity. This is can also be a viable solution, maybe we can use the PIR sensor for relatively near objects but I doubt it will work for objects farther away from a certain distance.

Thanks for the fantastic suggestion Gene, it will certainly improve our project's functioning.
- Cancel
- Vote Up +2 Vote Down
- Sign in to reply
- More
- Cancel
pranjalranjan299 over 7 years ago in reply to dixonselvan

Well the improvement is there in the probabilities, which you might not observe directly in the caption of the same picture but if there is a similar picture before which would have two similar probabilities of different predictions, the newly trained model would still give the same caption as it would have more probability. So in simple words the second model is more "confident" about the correct predictions.

As far as the second image is concerned, it's a clear example of overfitting which is very common in Deep Learning based on Images. As you increase the iterations, you are making the model learn the captions of its original dataset more strongly, and hence it will give out the captions slightly wrongly in certain cases. We are going to utilize a model which is a hybrid of these two so it can predict the caption with more probability as well as not suffer from overfitting.
- Cancel
- Vote Up +2 Vote Down
- Sign in to reply
- More
- Cancel
genebren over 7 years ago

Interesting update. I wonder on image 11.jpg, which caption is more correct (before or after). Before says 'driving', while after says 'parked'. There is a big difference between the two statements. Also, for motion, wouldn't a series of images (i.e. change in position or potential velocity) be a valuable input for a correct caption? Would your modeling allow for a caption based on differences between images?

Good luck,
Gene
- Cancel
- Vote Up +1 Vote Down
- Sign in to reply
- More
- Cancel