Visual Intelligence
Overview & Key Concept
Traditional computer vision gives you bounding boxes and labels like "person" or "car." EyePop's Visual Intelligence Agent goes beyond detection—it understands and describes what it sees using natural language prompts.
Instead of just knowing "there's a person here," you can ask:
"What's their age range and fashion style?"
"Are they wearing glasses?"
"What activity are they doing?"
"Describe their outfit in detail"
The Power of Visual Intelligence
Visual Intelligence combines object detection with vision-language understanding, enabling you to:
Ask any question about detected objects using natural language
Get structured responses with confidence scores
Build custom business logic around visual understanding
Create dynamic analysis that adapts to your specific needs
Key Component Distinction
Understanding when to use each component is crucial:
eyepop.localize-objects:latest
Find and label objects with bounding boxes
You need object locations and want custom bounding box labels
eyepop.image-contents:latest
Analyze and describe visual content with prompts
You need detailed analysis or descriptions of detected regions
Core Architecture Pattern
Visual Intelligence follows a simple but powerful pattern:
Detect → Crop → Analyze
How Component Chaining Works
Detection Component finds objects and creates bounding boxes
Forward Operator crops each detected region from the original image
Target Component receives cropped images and analyzes them with your custom prompts
Results combine detection data with natural language analysis
Essential Components
Detection Components
These components find objects and create regions for analysis:
eyepop.person:latest- Detect peopleeyepop.person.2d-body-points- Detect people with 2d body pointseyepop.vehicle:latest- Detect vehicleseyepop.text:latest- Detect texteyepop.common-objects- Detect common objectsFor the complete list of all available models and abilities, see Composable Pops Documentation
Visual Intelligence Component
eyepop.image-contents:latest- The core Visual Intelligence model that analyzes images based on your prompts
Forward Operators
ForwardOperatorType.CROP- Extract detected regions and send them to target componentsForwardOperatorType.FULL- Send the entire image to target components
Prompt Engineering for Vision
The Foundation Pattern
Always include this safety instruction in your prompts:
This ensures the model returns classLabel: null when uncertain, preventing hallucinations.
Effective Prompt Structure
✅ Good Prompts:
❌ Avoid These Patterns:
Understanding the Response Format
Visual Intelligence returns structured data:
When uncertain, you'll get:
Complete Examples
Example 1: Person Style Analysis
Analyze fashion and demographics of detected people:
What this does:
Detects people with 90% confidence threshold
Crops each detected person
Analyzes age, gender, fashion style, and outfit description
Returns structured data for each person
Example 2: Multi-Question Object Analysis
Ask multiple specific questions about detected objects:
Example 3: Activity Recognition
Understand what people are doing:
Custom Class Labels with Object Localization
When you need to find objects and display custom labels on bounding boxes, use eyepop.localize-objects:latest:

What this does:
prompttells the model what to find ("dog", "cat")labelsets what appears on the bounding box ("Best Friend", "Just a Cat")Returns bounding boxes with your custom labels instead of generic "dog" or "cat"
Custom Business Logic Examples
Retail Analytics:
Security & Safety:
Healthcare:
Common Pitfalls & Troubleshooting
When to Use Which Component
Use eyepop.localize-objects:latest when:
You need bounding boxes with custom labels
You're building object detection with specific labeling needs
You want to find specific objects by description
Use eyepop.image-contents:latest when:
You need detailed analysis or descriptions
You want to ask questions about visual content
You need structured responses to custom prompts
Prompt Design Mistakes
❌ Don't:
Ask leading questions ("Is the person happy?")
Make prompts too complex or long
Forget the null safety instruction
Use ambiguous terms without definition
✅ Do:
Ask open-ended analytical questions
Define categories clearly
Include the null safety pattern
Break complex analysis into multiple prompts
Confidence Considerations
Visual Intelligence returns confidence scores
Low confidence often correlates with
nullresponsesUse confidence thresholds in your application logic
Consider multiple prompts for critical decisions
Getting Started
Ready to build your own Visual Intelligence Pop? Start with this template:
Replace [YOUR_DETECTOR] with your chosen detection model and [YOUR QUESTION HERE] with your custom prompt, and you're ready to start analyzing!
Last updated