Evidence-Verified Trace-Level Reuse and Selective Regeneration for LLM-Based Cloud Incident Remediation

Main Article Content

Alejandro Serrano
Yi Guo

Abstract

Large language model (LLM) agents are increasingly used to interpret operational evidence, diagnose failures, and recommend remediation actions in cloud-native systems. However, incident requests often share stable reasoning structure while differing in localized evidence, service names, or root-cause conditions. Whole-response semantic caching is brittle in this setting, while full regeneration repeats the same evidence organization, localization, and action-selection work. This paper presents TracePatch, a backend-agnostic reuse layer for LLM-based cloud incident remediation. TracePatch stores prior agent outputs as ordered trace blocks, retrieves a similar incident request, verifies each block against the new log evidence, reuses blocks that pass verification, and selectively regenerates only the failing suffix or structured action block. The design combines evidence-aware trace verification, conservative skip-reuse for semantic drift, and final structured-output validation for root cause and remediation fields. We evaluate TracePatch on a reproducible controlled benchmark built from the public LogHub HDFS dataset. Across 720 replayed evaluation requests over three random seeds, TracePatch reduces mean latency proxy from 1.684 s to 0.939 s, reduces token usage from 118.5k to 93.3k tokens, and raises final-check pass rate from 88.8% to 94.9%. The reuse-only path handles 54.2% of requests, 21.7% require selective patching, and 24.2% trigger skip-reuse under stronger evidence or root-cause changes. These results indicate that trace-level reuse can reduce LLM serving cost for operational agents while preserving evidence-grounded correctness under localized perturbations.

Article Details

Section

Articles