SafeGround:

Know When to Trust GUI Grounding Models via Uncertainty Calibration

Qingni Wang^1*, Yue Fan^2*, Xin Eric Wang¹

¹UC Santa Barbara, ²UC Santa Cruz

Spatial uncertainty Risk-controlled actions Finite-sample FDR Black-box friendly

SafeGround helps GUI grounding models decide whether to execute, abstain, or cascade, preventing silent failures from uncertain clicks.

Why SafeGround

In GUI agents, a single wrong click can trigger costly or irreversible actions — from unintended payments to deleting important files. The real danger is silent failure: most grounding models always output a coordinate even when they are unsure.

SafeGround makes reliability actionable by estimating spatial uncertainty, calibrating a decision threshold with statistical guarantees, and enabling risk-controlled execution even with black-box models.

How It Works

Sample multiple grounding passes. SafeGround runs multiple stochastic predictions and aggregates them into a spatial distribution over the screen.

Measure spatial uncertainty. Uncertainty is derived from ambiguity, dispersion, and concentration of the spatial density map.

Calibrate with Learn-Then-Test. A threshold is learned on held-out data to control risk with finite-sample FDR guarantees.

Decide at test time. Low uncertainty executes directly; high uncertainty abstains or cascades to an expert model.

Spatial Uncertainty Signals

Ambiguity

Are there competing buttons or regions that could all be plausible?

$U_{TA} = \begin{cases} 1 - \dfrac{S_{(1)} - S_{(2)}}{S_{(1)} + \epsilon}, & M \ge 2 \\ \max(0.1, 1 - S_{(1)}), & \text{otherwise} \end{cases}$

Dispersion

Is the attention scattered across the screen instead of forming a tight cluster?

$U_{IE} = -\frac{1}{\log M}\sum_{i=1}^{M} \hat{p}_i \log(\hat{p}_i + \epsilon)$

Concentration

Is there a clear focal point that stays stable across runs?

$U_{CD} = 1 - \sum_{i=1}^{M} \hat{p}_i^{2}$

These signals are combined into a unified uncertainty score $U_{\text{COM}}$ for risk-controlled decision making.

Process of mapping coordinates to patch scores

From Coordinates to Region Scores: We aggregate stochastic sampling points into patch-level density maps to calculate the final uncertainty score.

Experiments

Statistical Safety Guarantees

“Maybe it works” isn’t good enough for high-stakes actions. SafeGround adopts the Learn Then Test (LTT) paradigm to calibrate the uncertainty threshold and provides finite-sample guarantees on the False Discovery Rate (FDR).

You set the risk level; SafeGround keeps the executed actions strictly within that limit.

Cascading Inference

SafeGround acts as a gatekeeper: easy tasks stay local, hard or risky tasks are detected and escalated. The cascading rate remains low while system-level performance improves.

Safe by default: execute only when uncertainty is low.

Efficient escalation: route only the truly hard cases to an expert model.

Better Error Detection

Standard confidence scores (logits) often fail to reflect execution risk in GUI grounding. SafeGround consistently outperforms Probabilistic Confidence (PC) in distinguishing correct from incorrect actions and more effectively filters erroneous actions while preserving correct ones.

58.66%

Accuracy on ScreenSpot-Pro with SafeGround gating.

+5.38%

Improvement over a Gemini-only baseline.

Risk-controlled

Executed actions respect the chosen FDR limit.

BibTeX

@misc{wang2026safegroundknowtrustgui,
      title={SafeGround: Know When to Trust GUI Grounding Models via Uncertainty Calibration},
      author={Qingni Wang and Yue Fan and Xin Eric Wang},
      year={2026},
      eprint={2602.02419},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.02419},
}