Self Healing AI Agent > Introduction to Self Healing AI Agent recipe
Introduction to Self Healing AI Agent recipe
The Self Healing AI Agent autonomously identifies failed and suspended taskflows in data pipelines, diagnoses the underlying issues, and applies corrective actions to resolve them. If the agent can't fix the problem, it creates a ServiceNow incident and sends an email containing the error message and relevant logs to the appropriate teams.
The agent performs the following functions:
•As a specialized job monitoring agent, it tracks and diagnoses job failures based on Standard Operating Procedures (SOP). The agent ensures security by always using session tokens for authentication and handles all timestamps in UTC to avoid timezone inconsistencies.
•The agent identifies the origin and context of incoming messages, whether from a chat invocation or an email. It detects HTML content to distinguish emails and adapts its workflow accordingly.
Using a variety of integrated tools, the agent executes each necessary step without requiring manual confirmation unless essential input is missing. It starts by authenticating with the GetLoginToken tool, then retrieves failed or suspended taskflows using TaskflowStatusRetriever or gathers detailed error information using ErrorAnalyserFlow when job details are provided.
Before proceeding with remediation, the agent runs the CheckSuccessfulRun tool to verify whether a successful execution has occurred since the failure to avoid redundant actions. It then consults the SOPAnalyst agent to find precise resolution steps and contact information specific to each failed taskflow and strictly follows those procedures.
The ValidationAgent performs required pre-checks, such as file existence verification or database connection tests described in the SOP. After validation succeeds, the agent uses ExtCommunicationAgent to notify stakeholders through email and RemediationAgent to restart taskflows or perform other corrective steps.
When SOP instructions specify escalation without automatic remediation, the agent raises a ServiceNow incident using ServiceNowAgent, providing all necessary details, such as assignment group SYSIDs, error messages, and job names. It retrieves and shares the incident number and link in its final report.
This structured process allows the agent to manage taskflow failures proactively, securely, and in compliance with SOPs. It provides a detailed summary of all actions taken and their outcomes to ensure effective monitoring and resolution.