Know When to Trust GUI Grounding Models via Uncertainty Calibration
SafeGround helps GUI grounding models decide whether to execute, abstain, or cascade, preventing silent failures from uncertain clicks.
In GUI agents, a single wrong click can trigger costly or irreversible actions — from unintended payments to deleting important files. The real danger is silent failure: most grounding models always output a coordinate even when they are unsure.
SafeGround makes reliability actionable by estimating spatial uncertainty, calibrating a decision threshold with statistical guarantees, and enabling risk-controlled execution even with black-box models.
Are there competing buttons or regions that could all be plausible?
Is the attention scattered across the screen instead of forming a tight cluster?
Is there a clear focal point that stays stable across runs?
These signals are combined into a unified uncertainty score $U_{\text{COM}}$ for risk-controlled decision making.
From Coordinates to Region Scores: We aggregate stochastic sampling points into patch-level density maps to calculate the final uncertainty score.
“Maybe it works” isn’t good enough for high-stakes actions. SafeGround adopts the Learn Then Test (LTT) paradigm to calibrate the uncertainty threshold and provides finite-sample guarantees on the False Discovery Rate (FDR).
You set the risk level; SafeGround keeps the executed actions strictly within that limit.
SafeGround acts as a gatekeeper: easy tasks stay local, hard or risky tasks are detected and escalated. The cascading rate remains low while system-level performance improves.
Safe by default: execute only when uncertainty is low.
Efficient escalation: route only the truly hard cases to an expert model.
Standard confidence scores (logits) often fail to reflect execution risk in GUI grounding. SafeGround consistently outperforms Probabilistic Confidence (PC) in distinguishing correct from incorrect actions and more effectively filters erroneous actions while preserving correct ones.
Accuracy on ScreenSpot-Pro with SafeGround gating.
Improvement over a Gemini-only baseline.
Executed actions respect the chosen FDR limit.
@misc{wang2026safegroundknowtrustgui,
title={SafeGround: Know When to Trust GUI Grounding Models via Uncertainty Calibration},
author={Qingni Wang and Yue Fan and Xin Eric Wang},
year={2026},
eprint={2602.02419},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.02419},
}