Visual Intelligence

Overview & Key Concept

Traditional computer vision gives you bounding boxes and labels like "person" or "car." EyePop's Visual Intelligence Agent goes beyond detection—it understands and describes what it sees using natural language prompts.

Instead of just knowing "there's a person here," you can ask:

  • "What's their age range and fashion style?"

  • "Are they wearing glasses?"

  • "What activity are they doing?"

  • "Describe their outfit in detail"

The Power of Visual Intelligence

Visual Intelligence combines object detection with vision-language understanding, enabling you to:

  • Ask any question about detected objects using natural language

  • Get structured responses with confidence scores

  • Build custom business logic around visual understanding

  • Create dynamic analysis that adapts to your specific needs

Key Component Distinction

Understanding when to use each component is crucial:

Component
Purpose
Use When

eyepop.localize-objects:latest

Find and label objects with bounding boxes

You need object locations and want custom bounding box labels

eyepop.image-contents:latest

Analyze and describe visual content with prompts

You need detailed analysis or descriptions of detected regions

Core Architecture Pattern

Visual Intelligence follows a simple but powerful pattern:

Detect → Crop → Analyze

How Component Chaining Works

  1. Detection Component finds objects and creates bounding boxes

  2. Forward Operator crops each detected region from the original image

  3. Target Component receives cropped images and analyzes them with your custom prompts

  4. Results combine detection data with natural language analysis

Essential Components

Detection Components

These components find objects and create regions for analysis:

  • eyepop.person:latest - Detect people

  • eyepop.person.2d-body-points - Detect people with 2d body points

  • eyepop.vehicle:latest - Detect vehicles

  • eyepop.text:latest - Detect text

  • eyepop.common-objects - Detect common objects

  • For the complete list of all available models and abilities, see Composable Pops Documentation

Visual Intelligence Component

  • eyepop.image-contents:latest - The core Visual Intelligence model that analyzes images based on your prompts

Forward Operators

  • ForwardOperatorType.CROP - Extract detected regions and send them to target components

  • ForwardOperatorType.FULL - Send the entire image to target components

Prompt Engineering for Vision

The Foundation Pattern

Always include this safety instruction in your prompts:

This ensures the model returns classLabel: null when uncertain, preventing hallucinations.

Effective Prompt Structure

✅ Good Prompts:

❌ Avoid These Patterns:

Understanding the Response Format

Visual Intelligence returns structured data:

When uncertain, you'll get:

Complete Examples

Example 1: Person Style Analysis

Analyze fashion and demographics of detected people:

What this does:

  1. Detects people with 90% confidence threshold

  2. Crops each detected person

  3. Analyzes age, gender, fashion style, and outfit description

  4. Returns structured data for each person

Example 2: Multi-Question Object Analysis

Ask multiple specific questions about detected objects:

Example 3: Activity Recognition

Understand what people are doing:

Custom Class Labels with Object Localization

When you need to find objects and display custom labels on bounding boxes, use eyepop.localize-objects:latest:

What this does:

  • prompt tells the model what to find ("dog", "cat")

  • label sets what appears on the bounding box ("Best Friend", "Just a Cat")

  • Returns bounding boxes with your custom labels instead of generic "dog" or "cat"

Custom Business Logic Examples

Retail Analytics:

Security & Safety:

Healthcare:

Common Pitfalls & Troubleshooting

When to Use Which Component

Use eyepop.localize-objects:latest when:

  • You need bounding boxes with custom labels

  • You're building object detection with specific labeling needs

  • You want to find specific objects by description

Use eyepop.image-contents:latest when:

  • You need detailed analysis or descriptions

  • You want to ask questions about visual content

  • You need structured responses to custom prompts

Prompt Design Mistakes

❌ Don't:

  • Ask leading questions ("Is the person happy?")

  • Make prompts too complex or long

  • Forget the null safety instruction

  • Use ambiguous terms without definition

✅ Do:

  • Ask open-ended analytical questions

  • Define categories clearly

  • Include the null safety pattern

  • Break complex analysis into multiple prompts

Confidence Considerations

  • Visual Intelligence returns confidence scores

  • Low confidence often correlates with null responses

  • Use confidence thresholds in your application logic

  • Consider multiple prompts for critical decisions

Getting Started

Ready to build your own Visual Intelligence Pop? Start with this template:

Replace [YOUR_DETECTOR] with your chosen detection model and [YOUR QUESTION HERE] with your custom prompt, and you're ready to start analyzing!

Last updated