Introducing Agentic Vision in Gemini 3 Flash
Summary
Agentic Vision is a new capability in Gemini 3 Flash that transforms image understanding from a static process into an active, agentic investigation by integrating visual reasoning with code execution. This process follows a Think, Act, Observe loop: the model plans steps, executes Python code to manipulate or analyze images (like cropping or counting), and observes the transformed output to ground its final response in visual evidence. Enabling this feature yields a consistent 5-10% quality boost across vision benchmarks. Practical applications include iteratively inspecting high-resolution inputs for plan validation, using a "visual scratchpad" to annotate images for accurate counting, and offloading complex visual math and plotting to a deterministic Python environment to avoid hallucinations. Future plans involve making more behaviors implicit, adding tools like web search, and expanding Agentic Vision to other Gemini model sizes. The feature is available today via the Gemini API in Google AI Studio and Vertex AI, and is rolling out in the Gemini app.
(Source:Gemini)