Video Annotation Best Practices with Annika Deurlington
In this post, we sit down with Annika Deurlington, Commercial Program Manager at CrowdAI. She shares best practices for video annotation and reveals hidden industry tricks.
Earlier in this series about deep learning (DL), my colleague Peter talked about neural networks and their importance to our work at CrowdAI. Following on from his post, I want to focus on how DL can lead to more accurate results. In the real world, datasets are almost never straightforward and without quirks; it can take people a very long time to comb through these datasets and draw out any meaningful and accurate information. Deep learning can be a powerful tool to speed up the process of finding insights from large, complex datasets in a fraction of the time.
Deep learning is an approach that involves using neural networks to learn algorithmic solutions to open-ended problems. In order to begin this approach, we must start off with a dataset. The more complex our data is, the more complex the function is, and the more complex the decision boundary is.
Currently, the most successful DL algorithms use an approach known as supervised learning. The supervision comes in the form of human-generated data labels that show the model what would be the perfect output for a given image (we often call this “ground truth”). For example, a DL model to classify dog/cat in images would need to be trained on a dataset of image + label pairs: for each image, we also have access to a label (either “cat” or “dog”) that was manually assigned to that image.
At their core, supervised learning approaches basically try to fit a function to the data. The data points are the images (training set), and the labels are the targets to fit. The perfect function would take an image and output the same label as a human annotator would. It would also be able to generalize on new images, meaning that it would still predict the correct label on an image that it’s never seen before. This function fitting is the “learning” part of deep learning.
Getting such a perfect function is obviously infeasible: on challenging images, even humans don’t produce labels that are always correct. However, getting close enough to that goal can be valuable enough in most cases.
Now, how do we find such a function? Neural networks are a wide family of functions that are commonly used for such tasks. They take an input (the image) and produce an output (the labels), but this output also depends on a number of internal parameters. Changing the values of these internal parameters will change how close the outputs of the function will be to the desired targets. The reason we use large and complex neural networks is so that we have access to a wider range of functions that can fit more complex datasets. This is the “deep” part of “deep learning”.
However, this comes at a cost, because more complex neural networks will have considerably more parameters, ranging in the millions or more. Finding the optimal values of these parameters that gives the optimal fit to the data becomes a very daunting task.
Imagine arriving in a strange bathroom containing a sink with 10 knobs that control the water temperature. You are trying to get water as hot as you can. As you turn the first knob clockwise to its maximum position, you realize that it affects the temperature in a non straightforward way: the temperature may increase, then decrease, then increase again. One option would be to set each knob to its position that gives the highest temperature. But this is doomed to fail because the knobs interact with each other. So once you set the first knob and start playing with the second knob, you will have to revisit the first knob again.
One other approach would be to go (painstakingly) through each combination of all the knob positions. This is the brute-force approach and it is also not scalable. Even with 10 knobs, considering only 2 positions per knob (e.g. off and on) would lead to 2^10 = 1024 configurations! Now imagine if there are more knobs (100? 1000?): this number grows exponentially and visiting all configurations quickly becomes impossible.
In DL, the parameters of the neural network play the role of the knobs in this analogy, and the water temperature is a metric we try to optimize, which can be thought of as how well the function fits the model. The preferred approach to this problem in DL is an iterative approach. We start with a random configuration (each knob, or parameter, is set to a random value). For each knob (or parameter), we take note of whether it should be turned clockwise or anticlockwise to increase the temperature (or increase the data fit) if it is turned on its own, which can be done efficiently. Finally we turn all the knobs a tiny amount in these directions. Hopefully, after this step the water temperature will be increased by a small amount, meaning that we now have a better configuration.
Each step takes some amount of time to complete. In my analogy, you can think of it as the amount of time before the water temperature stabilizes as you change the knobs. In DL, this is because computing the outputs of neural networks takes a certain amount of time. Obviously, in order to get a good configuration, it is necessary to repeat this process for many steps. This is why training a large neural network can be a time-consuming task.
Fitting a function to training data is one thing, but a harder task is to come up with a function that can also handle new data that it’s never seen before. This is a very real risk, and without the proper care you may end up with a model that perfectly fits the training data but does very badly on unseen data—which isn’t very useful.
This is less likely to happen as the size of the dataset grows, which is why DL is particularly suited to large datasets. A benefit of DL is that the models can continuously improve when presented with new training data, resulting in more and more accurate models over time.
At CrowdAI, our focus is on providing a code-free environment where anyone can create and train their own DL models. But behind the scenes of our platform, our DL engine is running on a foundation of good science! This means we make sure that our workflows are sound and that we extract the correct signal from the noise when you click “Start Training” in the platform.
For example, we take great care in making sure that the media you upload is split automatically for you into two subsets: one for training the model (“training set”) and one for testing how well that model generalizes (“test set”). A common problem in DL is contamination between these two sets. If media from your test set are accidentally shown to the model in the training set as well, it is sort of like giving it the answers to the final exam! This would result in an overly-optimistic estimate of the performance of the model that it may not be able to achieve on the real data it is deployed for. These good practices are built into our platform to take care of this for you.
Regardless of the type of media you want to use to train a model, I hope you’ll use CrowdAI to see the power of deep learning for yourself!