April 24, 2023

Using CrowdAI Models and GroundingDinos Models: Going from Trained Object Detection to Segmentation using SAM in the Blink of an Eye; All Happening Within the CrowdAI Platform

JB Boin and Taylor Maggos

Meta AI recently released a new zero-shot segmentation model called Segment Anything Model or SAM (Segment Anything, Kirillov et al.), which we have integrated into the CrowdAI platform. This new model is generating excitement among researchers and practitioners in the computer vision community due to its unique capabilities.

One of the most impressive features of SAM is its ability to allow for interactive segmentation using positive/negative clicks. This means that instead of painstakingly labeling each object by tracing the outline point by point, users can simply provide input by clicking on areas they want to include or exclude. This makes the annotation process much more efficient. This was covered in our last blog post.

Another feature of SAM which is very relevant to our product, at CrowdAI, is that it also accepts bounding box inputs. A user can draw bounding boxes around the objects of interest, and SAM will automatically segment the most likely object delimited by each box. Therefore, if we already have a trained object detection model, we can feed those bounding boxes into the SAM integration within our platform and obtain a segmentation mask for each object. This effectively gives us a segmentation model "for free" if we have already invested in annotating and training an object detection model.

To test out quickly moving from an object detection model to segmentation model using the SAM integration in the CrowdAI platform, we revisited a detection model we had built in the past. CrowdAI worked on automating visual inspection for a real estate developer. Their goal was to have tenants take photos of damage in their apartments and have our models automatically identify what the potential problem could be and on what appliance. The examples below are from a quick study CrowdAI completed to identify if a shower head was leaking – for the purpose of the study we used images of shower heads running and not running to train a model to identify water ‘leaking’ from the shower head.

Using the SAM integration in the CrowdAI platform, we fed the original bounding box prediction outputs through SAM and got the resulting segmentation masks. Below, Figure 1, we show the bounding boxes predicted by our model, as well as the proposed segmentation masks for those bounding boxes by SAM, on a selection of images that were not seen during training. As we can see, on this dataset the masks are very accurate. There are some obvious failure cases, for example some inconsistencies in whether the mask should include the part of the water stream that overlaps with the shower head or not, but overall the quality of the segmentation is high.

‍**Figure 1:** Comparison of model outputs from CrowdAI’s original detection model for water, coming from shower heads, and segmentation masks generated from the Segment Anything Model (SAM) integration in the CrowdAI platform using the bounding boxes from CrowdAI’s original detection model.

By only manually annotating and training an object detection model, we managed to get a very decent segmentation model using our platform integration with SAM.

Adding Zero-Shot Detection into The CrowdAI Platform

Is it possible to remove the need for manual bounding box annotations in the first place? This led us to explore platform integration with another type of foundation model: GroundingDINO (Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection, Liu et al.). This model was released in March 2023 and it established new state-of-the-art results for zero-shot detection. Given a text prompt and an image, the model will detect boxes corresponding to different concepts in the text prompt — a very thrilling incorporation to the CrowdAI platform interface.

For example, with the dataset used above, we could approximate the boxes detected by our supervised model by applying the GroundingDINO model on those images using the prompt: “water”. If we are also interested in segmenting another type of object, like the shower head, we can add it to the prompt as follows: “water . shower head”.

We ran the detector with this very prompt on a random selection of our shower images, and here are the results below, Figure 2.

‍**Figure 2:** Results from GroundingDINO model integration into the CrowdAI platform, using the prompt: “**water . shower head**”. The blue boxes correspond to the **water** class whereas the red boxes to the **shower head** class.

The results from our GroundingDINO platform integration are definitely not perfect, and we notice some duplicate, missing, or incorrectly located boxes. Overall, however, the quality is sufficient, and the majority of boxes are sufficiently close to what a reasonably experienced person would label by hand.

Now, we can use these bounding boxes as an input to our SAM integration on the CrowdAI platform and get some segmentation masks. We show the results below, Figure 3, for each class. We only show the mask corresponding to the box of the relevant class with the highest confidence score (if any).

‍**Figure 3:** Predicted segmentation results from feeding the GroundingDINO bounding box outputs from *Figure 2* into our SAM integration. The blue masks represent the ‘water’ class and the red masks represent the ‘shower head’ class.

Even though we can observe some pretty clear failure cases – which are either due to a failure of GroundingDINO or of SAM – it is pretty clear that many masks have a quality on par with manual human-in-the-loop labeling.

The strength of integrating this zero-shot approach into the CrowdAI platform is that it requires a substantially lower volume of data than traditional approaches. In practice, this can often make up for the reduced quality of masks as compared with manual labeling. Using GroundingDINO and SAM in conjunction within the CrowdAI platform leads to a process of semi-automatic or even fully-automatic labeling.

Instead of asking annotators to spend a large amount of time annotating an image, we could envision a future where zero-shot models run without human intervention on a large volume of images, and where the only decision process requiring a human would be deciding whether a particular mask should be kept or discarded. With very little manual work, CrowdAI can collect a tremendous amount of labeled data that could then be used to train a higher quality segmentation model going forward.

At CrowdAI, we are excited to integrate new technologies into our platform to make advancements in our computer vision capabilities. Easily moving between object detection models and segmentation models is an incredible advancement; but through using a combination of our integration with GroundingDINO and SAM, we can continue to increase our speed of going from raw pixels to a production ready model on our end-to-end platform like never before — a future of automated labeling at our fingertips.

Find out more about our platform and capabilities at Crowdai.com.

‍

May 22, 2023

“Small Devices, Big Impacts: Streaming Computer Vision Models at the Edge”

Running a computer vision model on a cell phone or mobile device is a powerful tool that can enable real-time analysis of images and videos, which can be useful in a variety of applications. While there are challenges to streaming computer vision models on small devices, CrowdAI has developed a roadmap of techniques and tools to overcome these challenges. By leveraging cloud driven API connections for invoking inference from a trained model, CrowdAI sees a pathway to real-time analysis of imagery and video on small devices operating at the edge. Additionally, the geospatial benefits of building models from media captured on cell phones can offer unique advantages for training, monitoring, and analyzing objects of interest.

Zeke Foppa and Taylor Maggos

May 8, 2023

Deploy Anywhere; Use Every Camera: The Power of the CrowdAI Platform

In today's world, where we are surrounded by computers and cameras of all types and sizes, it's essential for machine learning services to be deployment-agnostic and camera-agnostic. Being able to work in any cloud, hardware, or software environment; and to use any camera or sensor is an invaluable advantage that has become increasingly important in recent years as the use of cameras has exploded in various industries. These features allow for greater flexibility and ease of use—exactly what CrowdAI strives to provide—enabling ML to be used in a wider range of applications.

Patrick Collins and Taylor Maggos

May 1, 2023

Exploring how SAM and GroundingDino Increase Opportunities to Accelerate Semi- and Fully Automated Bounding Box Data Labeling

Going from a complex segmentation model to a simpler bounding box object detection model using SAM may seem like a bit of overkill, but there are some instances where an object detection model is favored over a segmentation model. For example, if we have a photo of a street with a bunch of pedestrians, a detection model can provide insight into how many people are there, their location in the frame, and how they interact with each other; segmentation masks wouldn’t give us as useful information since they would just be silhouettes of standing or walking people. Another benefit is that object detection models are designed to be more robust to variations in object size, rotation, and aspect ratio, making them ideal for identifying objects with diverse geometries. Lastly, when computational resources are limited, object detection models tend to be less computationally intensive than segmentation models, which can require more processing power and memory to run efficiently.

Zeke Foopa and Taylor Maggos

Using CrowdAI Models and GroundingDinos Models: Going from Trained Object Detection to Segmentation using SAM in the Blink of an Eye; All Happening Within the CrowdAI Platform

Get AI insights and best practices in your inbox

“Small Devices, Big Impacts: Streaming Computer Vision Models at the Edge”

Deploy Anywhere; Use Every Camera: The Power of the CrowdAI Platform

Exploring how SAM and GroundingDino Increase Opportunities to Accelerate Semi- and Fully Automated Bounding Box Data Labeling