Exploring how SAM and GroundingDino Increase Opportunities to Accelerate Semi- and Fully Automated Bounding Box Data Labeling
May 1, 2023

Exploring how SAM and GroundingDino Increase Opportunities to Accelerate Semi- and Fully Automated Bounding Box Data Labeling

Zeke Foopa and Taylor Maggos
Zeke Foopa and Taylor Maggos

We believe that integrating new AI technologies and fostering collaboration are crucial for the development of the CrowdAI platform and exploring opportunities to boost efficiency when building models. With a goal of uncovering what else is possible using our recent platform incorporations with SAM and GroundingDINO— our team continues to push boundaries. 

Our blog post from earlier this month, discusses the recent development of the Segment Anything Model (SAM) by Meta AI, which aims to revolutionize the segmentation process in computer vision architecture. SAM uses its zero-shot generalization technique, which allows users to label entire images in just a few clicks, making the process quicker and more efficient (Figure 1). Our blog post highlights how CrowdAI has incorporated SAM into our platform to aid in building domain-specific customized models while reducing the burden of labeling thousands of media. 

Once SAM was integrated into the CrowdAI platform and we saw the ease of segmentation data labeling with just a few clicks – questions started ringing in everyone's ears: Could we just input bounding boxes from object detection models and get a segmentation model – no clicks required? What if we didn't have to label the bounding boxes at all? Could it all be automated?

These questions led to more research and exploration, which we cover in our latest blog post, where we discussed the integration of not only SAM but GroundingDINO (Liu et al.) into the CrowdAI platform as well. The Segment Anything Model (SAM), provides segmentation labels using positive/negative clicks or bounding box inputs. The second model, GroundingDINO, is used for zero-shot detection. GroundingDINO can detect objects in images based on a text prompt. The output bounding boxes of our GroundingDINO integration are then fed into the SAM integration to obtain segmentation masks and allow for easy transformation from object detection to segmentation models (figure 2). 

While GroundingDINO’s zero-shot detection principle is novel, there are some circumstances where object detection based on a written prompt is not as viable. Could we use SAM’s segment everything feature as a way to pivot when GroundingDINO fails and maintain semi- or fully automated labeling for object detection models?

Below, Figure 3, is an example of our integration with GroundingDINO, where using the text prompt “Plane” has failed. 

If the goal is to create an object detection model and GroundingDINO is not successful, our previous workflow would be to hand label the bounding boxes around the object of interest. However, depending on the dataset at hand this could still be very tedious work, especially on satellite imagery– which is where GroundingDINO seems to not perform as well. To work smarter and not harder, we have been experimenting with leveraging a segmentation model produced by our SAM integration to create an object detection model with little labeling effort. 

Currently in the CrowdAI platform we offer the choice to override a training type (e.g. create labels for a segmentation model but train both a segmentation model and an object detection model- producing two trained models for the labeling efforts of one). This process can also be used to easily extract a bounding box around each segmentation mask created using our SAM integration. Here is an example below, Figure 4, of how we have moved from SAM segmentation masks to bounding boxes within the CrowdAI platform. 

The segmentation masks in the example above were created by using SAM’s feature segment everything, and then unwanted masks were filtered out by clicks. This process still requires manual work but took far less time than creating individual bounding boxes around each object. Through this process our team brainstormed additional features to implement into our data labeling interface to make this operation more streamlined; discarding masks based on box-area, discarding masks based on area, and tweaking ML confidence settings. These additions are in our roadmap to further reduce the need for manual labeling.

So why care about the ability to switch from segmentation to object detection? Going from a complex segmentation model to a simpler bounding box object detection model using SAM may seem like a bit of overkill, but there are some instances where an object detection model is favored over a segmentation model. For example, if we have a photo of a street with a bunch of pedestrians, a detection model can provide insight into how many people are there, their location in the frame, and how they interact with each other; segmentation masks wouldn’t give us as useful information since they would just be silhouettes of standing or walking people. Another benefit is that object detection models are designed to be more robust to variations in object size, rotation, and aspect ratio, making them ideal for identifying objects with diverse geometries. Lastly, when computational resources are limited, object detection models tend to be less computationally intensive than segmentation models, which can require more processing power and memory to run efficiently. 

Though it might seem that the need for object detection over segmentation models may be scarce, our goal through this research is to identify opportunities to cut down on data labeling time, saving costs of human effort to be better put to use elsewhere. 

The benefit to using SAM over human labeling— there is close to no manual labeling required. While CrowdAI believes human-in-the-loop labeling has a place in machine learning, especially for more customized models, we are not blind to the fact that automated labeling allows for groundbreaking advancements and quick iteration to transition AI solutions out of R&D. Overall, automated labeling offers several advantages over manual labeling, including speed, consistency, accuracy, scalability, and adaptability. When and where it is possible to invoke automated labeling, CrowdAI now has the capabilities to do so using our SAM and GroundingDINO integrations. 

Ultimately, CrowdAI recognizes the importance of investing in ongoing research and development to keep pace with this fast-moving field. Through our dedication to exploring new AI solutions and technologies, we push the boundaries of what is possible in computer vision, unlocking new insights, and creating innovative solutions that help shape the future. By incorporating new AI technologies into our platform, like SAM and GroundingDino, CrowdAI continues to evolve and provide new solutions that are changing the computer vision industry.

Find out more about our platform and capabilities at Crowdai.com.

Advancing AI
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.