Using CrowdAI Models and GroundingDinos Models: Going from Trained Object Detection to Segmentation using SAM in the Blink of an Eye; All Happening Within the CrowdAI Platform
April 24, 2023

Using CrowdAI Models and GroundingDinos Models: Going from Trained Object Detection to Segmentation using SAM in the Blink of an Eye; All Happening Within the CrowdAI Platform

JB Boin and Taylor Maggos
JB Boin and Taylor Maggos

Meta AI recently released a new zero-shot segmentation model called Segment Anything Model or SAM (Segment Anything, Kirillov et al.), which we have integrated into the CrowdAI platform. This new model is generating excitement among researchers and practitioners in the computer vision community due to its unique capabilities.

One of the most impressive features of SAM is its ability to allow for interactive segmentation using positive/negative clicks. This means that instead of painstakingly labeling each object by tracing the outline point by point, users can simply provide input by clicking on areas they want to include or exclude. This makes the annotation process much more efficient. This was covered in our last blog post.

Another feature of SAM which is very relevant to our product, at CrowdAI, is that it also accepts bounding box inputs. A user can draw bounding boxes around the objects of interest, and SAM will automatically segment the most likely object delimited by each box. Therefore, if we already have a trained object detection model, we can feed those bounding boxes into the SAM integration within our platform and obtain a segmentation mask for each object. This effectively gives us a segmentation model "for free" if we have already invested in annotating and training an object detection model.

To test out quickly moving from an object detection model to segmentation model using the SAM integration in the CrowdAI platform, we revisited a detection model we had built in the past. CrowdAI worked on automating visual inspection for a real estate developer. Their goal was to have tenants take photos of damage in their apartments and have our models automatically identify what the potential problem could be and on what appliance. The examples below are from a quick study CrowdAI completed to identify if a shower head was leaking – for the purpose of the study we used images of shower heads running and not running to train a model to identify water ‘leaking’ from the shower head. 

Using the SAM integration in the CrowdAI platform, we fed the original bounding box prediction outputs through SAM and got the resulting segmentation masks. Below, Figure 1, we show the bounding boxes predicted by our model, as well as the proposed segmentation masks for those bounding boxes by SAM, on a selection of images that were not seen during training. As we can see, on this dataset the masks are very accurate. There are some obvious failure cases, for example some inconsistencies in whether the mask should include the part of the water stream that overlaps with the shower head or not, but overall the quality of the segmentation is high.

Figure 1: Comparison of model outputs from CrowdAI’s original detection model for water, coming from shower heads, and segmentation masks generated from the Segment Anything Model (SAM) integration in the CrowdAI platform using the bounding boxes from CrowdAI’s original detection model.

By only manually annotating and training an object detection model, we managed to get a very decent segmentation model using our platform integration with SAM.

Adding Zero-Shot Detection into The CrowdAI Platform 

Is it possible to remove the need for manual bounding box annotations in the first place? This led us to explore platform integration with another type of foundation model: GroundingDINO (Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection, Liu et al.). This model was released in March 2023 and it established new state-of-the-art results for zero-shot detection. Given a text prompt and an image, the model will detect boxes corresponding to different concepts in the text prompt — a very thrilling incorporation to the CrowdAI platform interface. 

For example, with the dataset used above, we could approximate the boxes detected by our supervised model by applying the GroundingDINO model on those images using the prompt: “water”. If we are also interested in segmenting another type of object, like the shower head, we can add it to the prompt as follows: “water . shower head”. 

We ran the detector with this very prompt on a random selection of our shower images, and here are the results below, Figure 2

Figure 2: Results from GroundingDINO model integration into the CrowdAI platform, using the prompt: “water . shower head”. The blue boxes correspond to the water class whereas the red boxes to the shower head class.

The results from our GroundingDINO platform integration are definitely not perfect, and we notice some duplicate, missing, or incorrectly located boxes. Overall, however, the quality is sufficient, and the majority of boxes are sufficiently close to what a reasonably experienced person would label by hand. 

Now, we can use these bounding boxes as an input to our SAM integration on the CrowdAI platform and get some segmentation masks. We show the results below, Figure 3, for each class. We only show the mask corresponding to the box of the relevant class with the highest confidence score (if any).

Figure 3: Predicted segmentation results from feeding the GroundingDINO bounding box outputs from Figure 2 into our SAM integration. The blue masks represent the ‘water’ class and the red masks represent the ‘shower head’ class. 

Even though we can observe some pretty clear failure cases – which are either due to a failure of GroundingDINO or of SAM – it is pretty clear that many masks have a quality on par with manual human-in-the-loop labeling.

The strength of integrating this zero-shot approach into the CrowdAI platform is that it requires a substantially lower volume of data than traditional approaches. In practice, this can often make up for the reduced quality of masks as compared with manual labeling. Using GroundingDINO and SAM in conjunction within the CrowdAI platform leads to a process of semi-automatic or even fully-automatic labeling.

Instead of asking annotators to spend a large amount of time annotating an image, we could envision a future where zero-shot models run without human intervention on a large volume of images, and where the only decision process requiring a human would be deciding whether a particular mask should be kept or discarded. With very little manual work, CrowdAI can collect a tremendous amount of labeled data that could then be used to train a higher quality segmentation model going forward. 

At CrowdAI, we are excited to integrate new technologies into our platform to make advancements in our computer vision capabilities. Easily moving between object detection models and segmentation models is an incredible advancement; but through using a combination of our integration with GroundingDINO and SAM, we can continue to increase our speed of going from raw pixels to a production ready model on our end-to-end platform like never before — a future of automated labeling at our fingertips.

Find out more about our platform and capabilities at

Advancing AI
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.