The Level 1 SRE Agent: Autonomous FinOps Remediation with Qwen3-Max and OOS

If your organization is like most mature cloud adopters, your FinOps dashboards are a masterpiece of visibility. You have granular cost allocation, predictive forecasting, and real-time anomaly detection. Yet, at the end of every month, your cloud bill remains stubbornly high.

Why? Because visibility is not remediation.

We have successfully engineered alert fatigue into our FinOps practices. When a Slack or DingTalk channel is bombarded with 50 automated messages a day stating “Idle ECS Instance Detected (i-bp123456…)”, the human response is universally identical: we mute the channel. The friction of investigating the instance, ensuring it isn’t a dormant disaster recovery node, snapshotting the disk, and tearing it down manually is simply too high for a busy Site Reliability Engineer (SRE).

In this tutorial, we are going to bridge the gap between visibility and action. We will architect a “Level 1 SRE Agent”—an autonomous FinOps remediation pipeline built on Alibaba Cloud. Instead of just alerting you to wasted spend, this agent leverages Qwen3-Max to write the infrastructure code to fix it, seeks human approval via DingTalk, and securely executes the teardown using Operation Orchestration Service (OOS).

1. The End of Passive FinOps: Why Alerts Fail

Traditional FinOps tooling operates on a read-only paradigm. Tools scan your environment, identify resources with <5% CPU utilization over 7 days or zero ingress traffic, and fire an alert.

The problem is the remediation gap. To act on an idle alert, an engineer must:

Context-switch from their current task.
Log into the Alibaba Cloud Console.
Verify the resource’s tags and dependencies.
Manually trigger a backup/snapshot (just in case).
Delete the resource.
Document the action for compliance.

This multi-step process guarantees that low-priority idle resources are ignored until a quarterly “cost-cutting sprint.” To fix this, we must shift from notification to orchestration. We need a system that prepares the execution environment so that the human only needs to make a binary decision: Approve or Reject.

2. Architecture Flow: From Telemetry to Teardown

Our architecture combines observability, generative AI, and deterministic infrastructure automation. Here is the lifecycle of our Level 1 SRE Agent:

Telemetry & Detection (SLS): Alibaba Cloud Simple Log Service (SLS) ingests VPC Flow Logs and CloudMonitor metrics. An SLS Alert triggers when an Elastic Compute Service (ECS) instance shows zero external traffic and <1% CPU usage for 7 consecutive days.
Event Routing (EventBridge): The SLS Alert publishes an event to Alibaba Cloud EventBridge, which filters and routes the payload to a Function Compute (FC) instance.
AI Orchestration (Qwen3-Max): The FC instance invokes the Qwen3-Max LLM API. The AI analyzes the payload, identifies the instance, checks its tags, and drafts a custom Operation Orchestration Service (OOS) template tailored to safely deprecate this specific resource.
The Circuit Breaker (DingTalk): The drafted OOS execution is created but immediately paused. A rich, interactive DingTalk card is sent to the FinOps/SRE Lead, detailing the AI’s findings, the projected cost savings, and the generated OOS template.
Deterministic Execution (OOS): Upon human approval (via a cryptographic webhook from DingTalk), the OOS state machine resumes, taking a snapshot of the ECS disks and subsequently terminating the instance.

3. Implementation Details: Routing and Orchestration

The magic of this architecture lies not just in the LLM, but in how we constrain the LLM using Alibaba Cloud Operation Orchestration Service (OOS). OOS is a serverless, stateful execution engine for automated operations. By forcing Qwen3-Max to output an OOS template rather than executing raw API calls, we ensure idempotency, state tracking, and strict IAM boundaries.

Step A: EventBridge Routing

First, we define an EventBridge rule that listens for our specific SLS FinOps alerts and routes them to our AI function.

JSON

{
  "source": ["acs.sls"],
  "type": ["sls.alert.trigger"],
  "data": {
    "alertName": ["FinOps-Idle-ECS-Detector"],
    "severity": ["High"]
  }
}

When this event hits Function Compute, the Python handler extracts the instance ID (i-bp1...) and region, passing it to Qwen3-Max with a strict system prompt: “You are an SRE agent. Generate a valid Alibaba Cloud OOS YAML template to snapshot the system disk and delete the provided ECS instance. Output only valid YAML.”

Step B: The AI-Generated OOS Template

Qwen3-Max is remarkably adept at writing Alibaba Cloud infrastructure code. For an idle instance, it will generate a customized OOS template. Let’s examine a typical output generated by the AI:

YAML

FormatVersion: OOS-2019-06-01
Description: "Automated FinOps Remediation: Snapshot and Terminate Idle ECS"
Parameters:
  InstanceId:
    Type: String
    Description: "The ID of the idle ECS instance to terminate."
    Default: "i-bp1abcd1234efgh5678"
  OOSAssumeRole:
    Type: String
    Description: "The RAM role OOS will assume to execute this task."
    Default: "acs:ram::1234567890123456:role/FinOpsOOSExecutionRole"

Tasks:
  - Name: CheckInstanceStatus
    Action: ACS::ExecuteAPI
    Description: "Verify the instance is actually running or stopped."
    Properties:
      Service: ECS
      API: DescribeInstances
      Parameters:
        InstanceIds: '["{{ InstanceId }}"]'
    Outputs:
      Status:
        Type: String
        ValueSelector: ".Instances.Instance[0].Status"

  - Name: CreateDiskSnapshot
    Action: ACS::ECS::CreateSnapshot
    Description: "Take a safety snapshot of the system disk before deletion."
    Properties:
      InstanceId: "{{ InstanceId }}"
      Tags:
        - Key: "AutomatedBackup"
          Value: "FinOpsAgent"
    Outputs:
      SnapshotId:
        Type: String
        ValueSelector: ".SnapshotId"

  - Name: DeleteIdleInstance
    Action: ACS::ECS::DeleteInstance
    Description: "Terminate the instance to stop billing."
    Properties:
      InstanceId: "{{ InstanceId }}"
      Force: true
    DependsOn: CreateDiskSnapshot

Why OOS is the Secret Weapon Here:

Notice what is happening in the YAML. OOS inherently handles dependency mapping (DependsOn: CreateDiskSnapshot). If the snapshot fails due to an API quota limit or a locked disk, the DeleteIdleInstance task never executes.

If we allowed the AI to just write a Python script using the Alibaba Cloud SDK, managing those state transitions, error retries, and backoffs would be a nightmare. OOS abstracts the state machine away from the AI, providing a secure, declarative guardrail.

4. The “MVP” Failure Mode: Guarding Against Hallucinations

As a Senior Architect, you know that placing an LLM directly in charge of infrastructure deletion is a resume-generating event. What happens if Qwen3-Max hallucinates, misinterprets a metric, and decides that your primary Production PostgreSQL node is “idle” because it’s waiting on a long-polling queue?

This is the “MVP” (Minimum Viable Product) failure mode of AI automation. To prevent catastrophic blast radius, we implement a Human-in-the-Loop (HITL) Circuit Breaker.

The DingTalk Cryptographic Webhook Architecture

We do not allow the initial Function Compute instance to execute the OOS template. Instead, it registers the template with OOS, creates an execution, and immediately pauses it.

The architecture relies on the ACS::Approval action natively built into OOS, combined with DingTalk’s interactive messaging capabilities.

Pause State: The OOS execution reaches an ACS::Approval task. OOS halts and waits for an external callback.
DingTalk Notification: Function Compute pushes an interactive ActionCard to the Senior Engineer’s DingTalk group. The card contains:
- Resource: ECS i-bp1abcd... (Web-Frontend-Test)
- Reason: 0 bytes network ingress for 168 hours.
- AI Action: Snapshot Disk d-bp1... and Terminate.
- Monthly Savings: $145.00
- Buttons: [Approve and Terminate] | [Reject and Keep]
Cryptographic Validation: When the engineer taps Approve, DingTalk sends an HTTP POST request to an API Gateway endpoint.

To prevent replay attacks or unauthorized approvals, this webhook must be cryptographically verified. The payload from DingTalk includes a timestamp and a signature generated using your DingTalk App Secret.

Here is how the verification logic looks in our backend Function Compute handler that processes the DingTalk webhook:

Python

import hmac
import hashlib
import base64
import time

def verify_dingtalk_signature(timestamp, sign, secret):
    # Prevent replay attacks by checking timestamp freshness (e.g., within 5 mins)
    current_time = int(round(time.time() * 1000))
    if abs(current_time - int(timestamp)) > 300000:
        return False
        
    # Construct the string to sign
    string_to_sign = f"{timestamp}\n{secret}"
    
    # Generate HMAC-SHA256 signature
    hmac_code = hmac.new(
        secret.encode('utf-8'), 
        string_to_sign.encode('utf-8'), 
        digestmod=hashlib.sha256
    ).digest()
    
    # Base64 encode the result
    calculated_sign = base64.b64encode(hmac_code).decode('utf-8')
    
    return calculated_sign == sign

Once the signature is verified and the user’s DingTalk ID is checked against a hardcoded list of authorized FinOps approvers, the function makes a single SDK call to Alibaba Cloud OOS: NotifyExecution.

Python

# Alibaba Cloud Python SDK snippet to resume OOS
request = NotifyExecutionRequest.NotifyExecutionRequest()
request.set_ExecutionId(oos_execution_id)
request.set_NotifyType("Complete")
request.set_NotifyStatus("Approved")
response = client.do_action_with_exception(request)

At this exact moment, the circuit breaker closes. OOS resumes execution, the snapshot is taken, the instance is destroyed, and the human engineer never had to leave their chat window.

5. Conclusion: From Visibility to Autonomy

The evolution of Cloud Infrastructure Management is moving rapidly from passive dashboards to active automation. By combining the deep contextual reasoning of Qwen3-Max with the deterministic, robust orchestration capabilities of Alibaba Cloud OOS, we fundamentally change the economics of cloud management.

We no longer rely on engineers to hunt down $50/month idle instances manually—a task that historically cost more in engineering hours than it saved in infrastructure bills. Instead, the Level 1 SRE Agent does the heavy lifting: analyzing the metrics, formulating the remediation plan, writing the code, and preparing the execution state.

The Senior Engineer is elevated from a manual operator to an approver of automated intelligence.

By enforcing strict IAM roles within OOS, leveraging declarative YAML to prevent logical runaways, and securing the pipeline with cryptographic DingTalk webhooks, we achieve the holy grail of modern FinOps: Autonomy without compromising safety.