ESPRESSIF · ESP-VISION OPEN-SOURCE FRAMEWORK

Real-time vision AI.
In a few lines of Python.
On ESP32.

Capture, processing, inference, and display — all on the device.

ESP-VISION live detection feed
01LOW-CODE

Less code. More power.

Camera, image processing, on-device inference, and hardware peripherals are all accessible through Python APIs.

OUTPUTESP32-P4
Object Detection
Per-frame inference with a quantized ESP-DL modelVision AI
main.py17 lines
import espdl
import sensor
import time

sensor.reset()
sensor.set_pixformat(sensor.RGB565)
sensor.set_framesize(sensor.QVGA)
sensor.skip_frames(time=1000)

det = espdl.ESPDet("/sdcard/hand_det.espdl", score=0.5, nms=0.7)
while True:
    img = sensor.snapshot()
    for x, y, w, h, score, category in det.detect(img):
        img.draw_rectangle(x, y, w, h, color=(255, 0, 0), thickness=2)
        img.draw_string(x, max(0, y - 12), "%.2f:%d" % (score, category))
    img.flush()
    time.sleep_ms(20)
02CAPABILITIES

Complexity. Handled.

Capture, processing, inference, and control all run on the device.

API01

Low-Code Python API

Unified sensor · image · display · espdl APIs — real-time results in a few lines of Python. Flash online, connect the Web IDE, no toolchain to set up.

AI02

On-Device ESP-DL Inference

Object detection, pose estimation, and image classification; load a quantized .espdl model in one line — real-time, offline, on device — or bring your own PyTorch / TensorFlow model.

VLM03

Cloud Vision LLMs

Stream straight to OpenAI-compatible vision APIs and tap multimodal models like GPT-4o — complex scene understanding with no local compute.

IMG04

Image Processing

Drawing, filtering, color tracking, feature detection, QR codes, barcodes, and AprilTags.

CDC05

Hardware Codec

H.264 / MJPEG / RTSP and USB CDC live preview, saturating the on-chip multimedia accelerators.

IO06

Rich Peripheral Support

Cameras, displays, SPI, I2C, UART, SD cards and more work out of the box, with a MicroPython machine-compatible API.

03BROWSER-BASED

Zero setup. Just go.

No toolchain installation is required; flashing, writing, and running all happen in the browser.

BROWSER

One-Click Flashing

Connect your board straight from the browser and flash the official firmware.

Online Flashing · Web Serial○ No device connected

Loading firmware manifest…

IDE / VS CODE

Write & Run

Write scripts in the Web IDE or VS Code and watch capture and inference results in real time.

04CONNECT AI

Describe it. AI builds it.

Connect the ESP-VISION MCP to an AI assistant to build edge vision applications through conversation.

Server URL
https://mcp.vision.espressif.com
Auto Install

Click the button below to add this server to Cursor automatically.

Add to Cursor →
Manual config (mcp.json)
{
  "mcpServers": {
    "esp-vision-mcp": {
      "url": "https://mcp.vision.espressif.com"
    }
  }
}
05MODELS

Any model. Just one line.

ESP-DL and TFLite Micro models are loaded from device storage in a single line of code.

ESP-DL7 models
ModelTaskInputDatasetSize
ESPDet Pico Cat
ESPDet Pico · espdl.ESPDet

Detects cats in camera images.

cat
Object Detection
224×224
RGB565
Cat487 KB
ESPDet Pico Cat & Dog
ESPDet Pico · espdl.ESPDet

Detects cats and dogs in camera images.

catdog
Object Detection
224×224
RGB565
Cat & Dog561 KB
ESPDet Pico Dog
ESPDet Pico · espdl.ESPDet

Detects dogs in camera images.

dog
Object Detection
224×224
RGB565
Dog486 KB
ESPDet Pico Face
ESPDet Pico · espdl.ESPDet

Detects human faces in camera images.

face
Object Detection
224×224
RGB565
Face484 KB
ESPDet Pico Hand
ESPDet Pico · espdl.ESPDet

Detects human hands in camera images.

hand
Object Detection
224×224
RGB565
Hand486 KB
YOLO11n COCO
YOLO11n · espdl.YOLO11

Detects the 80 COCO object classes in camera images.

Object Detection
160×160
RGB565
COCO2.7 MB
YOLO11n-Pose COCO
YOLO11n-Pose · espdl.YOLO11nPose

Detects people and estimates 17 COCO body keypoints (human pose) in camera images.

person
Pose Estimation
160×160
RGB565
COCO3.0 MB
TFLite Micro2 models
ModelTaskInputDatasetSize
Person Detection
MobileNet · tflite.Model

TensorFlow Lite Micro person-detection model: classifies whether a person is present in a 96x96 grayscale camera frame.

no personperson
Image Classification
96×96
GRAYSCALE
Visual Wake Words294 KB
Sine
MLP · tflite.Model

TensorFlow Lite Micro "hello world" model: approximates sin(x) for x in [0, 2*pi] from a single scalar input.

Regression
1
FLOAT32
Synthetic2 KB
06BOARDS

One framework. Every board.

Camera, display, and storage work out of the box across the ESP32-S31, ESP32-P4, and ESP32-S3 series.

ImageBoardChipESP-VISION Support
ESP32-P4X-EYE
ESP32-P4X-EYEESP32-P4
Supported
sensor · image · display · espdl · tflite · imageio · h264 · rtsp · barcode
ESP32-P4X-Function-EV-Board
ESP32-P4X-Function-EV-BoardESP32-P4
Supported
sensor · image · display · espdl · tflite · imageio · h264 · rtsp · barcode
ESP32-S3-EYE
ESP32-S3-EYEESP32-S3
Supported
sensor · image · display · espdl · tflite · imageio
ESP32-S31-Korvo
ESP32-S31-KorvoESP32-S31
SupportedESP-IDF master only
sensor · image · display · espdl · tflite · imageio