What is Computer Use
Gemini Computer use just launched and I think it is going to be amazing. So much of programming is making integrations to different systems, and many systems does not have integrations yet.
Also integrations also requires something both searching, combining and inserting records, where a interface often combines those things into one.
So what is computer use? It is the next evolution of Large Language Models, it means that they are able to look at screenshots and take action. Previously LLM's could also look at screenshots (essentially just pictures) and try to predict what to do, but so much of automations is keeping a red thread through your actions. A model trained on a series of actions will be so much better at keeping the task in mind and trying different things while solving the task.
But we are still early!
The Computer Use from Gemini is pretty bare, you have to combine it with some kind of computer, and Google Cloud does not offer anything like that. You can use it with other startups like Browserbase which has recently raised a lot of money, but even their tech does not help much with debugging.
Debugging is knowing what the model clicked on and what it tried to do. You also want to combine it with logged in states and allowing remote control when a Recaptcha shows up.
Simple Gemini Computer Use Example
Here is an example showing how to use Gemini's Computer Use API:
The API Call
import { GoogleGenAI } from '@google/genai'
const client = new GoogleGenAI({
apiKey: process.env.GEMINI_API_KEY,
})
const response = await client.models.generateContent({
model: 'gemini-2.5-computer-use-preview-10-2025',
contents: [
{
role: 'user',
parts: [
{
text: 'Go to dsb.dk and search for the next train to Copenhagen from Århus',
},
],
},
],
config: {
tools: [
{
computerUse: {
environment: 'ENVIRONMENT_BROWSER',
},
},
],
},
})
The Response
The model will respond with a function call telling you what to do:
{
"candidates": [
{
"content": {
"parts": [
{
"functionCall": {
"name": "open_web_browser"
}
}
],
"role": "model"
}
}
]
}
What You Do Next
- Execute the action (open the browser to that URL)
- Take a screenshot
- Send back a function response with the screenshot:
const turn2Response = await client.models.generateContent({
model: 'gemini-2.5-computer-use-preview-10-2025',
contents: [
{
role: 'user',
parts: [
{
text: 'Go to dsb.dk and search for the next train to Copenhagen from Århus',
},
],
},
{
role: 'model',
parts: [
{
functionCall: {
name: 'open_web_browser',
},
},
],
},
{
role: 'user',
parts: [
{
functionResponse: {
name: 'open_web_browser',
response: { url: 'https://dsb.dk' },
},
},
{
inlineData: {
mimeType: 'image/jpeg',
data: screenshotBytes.toString('base64'),
},
},
],
},
],
config: {
tools: [
{
computerUse: {
environment: 'ENVIRONMENT_BROWSER',
},
},
],
},
})
The Second Response
Gemini now sees the screenshot and might respond with the answer:
{
"candidates": [
{
"content": {
"parts": [
{
"text": "The next train is at XX"
}
],
"role": "model"
}
}
]
}
Or it might ask to click something:
{
"candidates": [
{
"content": {
"parts": [
{
"functionCall": {
"name": "click",
"args": {
"x": 250,
"y": 180
}
}
}
],
"role": "model"
}
}
]
}
Gemini Actions
The Computer Use API supports these function calls:
open_web_browser- Start webbrowsernavigate- Go to a URLclick- Click at coordinates (x, y on a 1000x1000 grid)type_text- Type text into focused elementkey_press- Press keyboard keysscroll- Scroll up or downnew_tab- Open a new browser tabclose_tab- Close current tabswitch_tab- Switch to a different tab
The loop
So the loop you should run it in:
- Sends task
- Model responds with action
- You execute action and send screenshot
- Repeat steps 2-3 until model returns answer