Developer testing debugging ci-cd

Testing and Debugging Skills

A practical guide to testing agent skills at every level: unit tests for logic, integration tests with mock LLM responses, snapshot tests for output stability, and debugging techniques for when things go wrong.

Agent skills sit at the intersection of deterministic code and non-deterministic AI behavior. The skill implementation itself is regular software that you can test with traditional methods, but the way an agent invokes that skill depends on the LLM’s interpretation of descriptions and parameters. This dual nature demands a layered testing strategy.

This guide walks through four levels of testing (unit, integration, snapshot, and end-to-end) plus debugging techniques for tracking down issues in production.

Level 1: unit testing skill logic

Unit tests verify that your skill’s core logic works correctly in isolation, without any agent involvement. This is the foundation. If the underlying function is broken, no amount of prompt engineering will save it.

Testing the handler function

Strip away the skill definition and test the raw handler:

import { describe, it, expect, beforeEach, afterEach } from "vitest";
import { createNote, NoteStore } from "./note-skill";

describe("createNote handler", () => {
  let store: NoteStore;

  beforeEach(() => {
    store = new NoteStore(":memory:"); // In-memory database for tests
  });

  afterEach(async () => {
    await store.close();
  });

  it("creates a note and returns its ID", async () => {
    const result = await createNote(store, {
      title: "Test Note",
      body: "Some content",
      tags: ["test"],
    });

    expect(result.success).toBe(true);
    expect(result.note.id).toBeDefined();
    expect(result.note.title).toBe("Test Note");
  });

  it("rejects empty titles", async () => {
    const result = await createNote(store, {
      title: "",
      body: "Content",
    });

    expect(result.success).toBe(false);
    expect(result.error).toContain("empty");
    expect(result.suggestion).toBeDefined();
  });

  it("rejects duplicate titles with existing note ID", async () => {
    await createNote(store, { title: "First", body: "Content" });
    const result = await createNote(store, {
      title: "First",
      body: "Different content",
    });

    expect(result.success).toBe(false);
    expect(result.existingNoteId).toBeDefined();
    expect(result.suggestion).toContain("update_note");
  });

  it("handles titles at the character limit", async () => {
    const longTitle = "A".repeat(200);
    const result = await createNote(store, {
      title: longTitle,
      body: "Content",
    });

    expect(result.success).toBe(true);
  });

  it("rejects titles exceeding the character limit", async () => {
    const tooLong = "A".repeat(201);
    const result = await createNote(store, {
      title: tooLong,
      body: "Content",
    });

    expect(result.success).toBe(false);
    expect(result.error).toContain("201");
    expect(result.error).toContain("200");
  });
});

What to test at the unit level

CategoryWhat to verifyExample
Happy pathCorrect input produces correct outputCreate note returns ID and title
Input validationInvalid inputs produce clear errorsEmpty title is rejected with suggestion
Boundary conditionsEdge cases at limitsTitle at exactly 200 chars succeeds, 201 fails
Error recovery infoError responses include suggestionsDuplicate title error includes existing note ID
Default valuesOptional params use correct defaultsMissing tags defaults to empty array
IdempotencyRepeated calls produce expected behaviorSecond call with same title returns existing note

Testing external dependencies

Skills often call external services like databases, APIs, and file systems. Use dependency injection to swap real implementations for test doubles:

// The skill accepts its dependencies as parameters
async function searchWeb(
  query: string,
  httpClient: HttpClient = defaultHttpClient,
) {
  const response = await httpClient.get(
    `https://api.search.com/v1/search?q=${encodeURIComponent(query)}`,
  );
  return formatResults(response.data);
}

// In tests, inject a mock
it("formats search results correctly", async () => {
  const mockClient = {
    get: async () => ({
      data: {
        results: [
          { title: "Result 1", url: "https://example.com", snippet: "..." },
        ],
      },
    }),
  };

  const result = await searchWeb("test query", mockClient);
  expect(result.matches).toHaveLength(1);
  expect(result.matches[0]).toHaveProperty("title", "Result 1");
});

it("handles API errors gracefully", async () => {
  const failingClient = {
    get: async () => {
      throw new Error("Connection refused");
    },
  };

  const result = await searchWeb("test query", failingClient);
  expect(result.success).toBe(false);
  expect(result.suggestion).toContain("try again");
});

Level 2: integration testing with mock LLM responses

Unit tests verify your code. Integration tests verify that your skill definition, including the name, description, and parameter schema, works correctly when an agent-like caller constructs invocations.

Simulating agent invocations

Create a test harness that simulates how an agent would call your skill based on a natural language request:

import { describe, it, expect } from "vitest";
import { resolveSkillCall } from "./test-harness";
import { noteSkillDefinition } from "./note-skill";

describe("note skill invocation patterns", () => {
  // These tests verify that realistic agent-constructed
  // parameters are handled correctly

  it("handles a well-formed create request", async () => {
    // Simulate what an agent would construct
    const agentParams = {
      title: "Meeting Notes - March 26",
      body: "## Attendees\n- Alice\n- Bob\n\n## Action Items\n- Follow up on Q2 plan",
      tags: ["meetings", "q2-planning"],
    };

    const result = await resolveSkillCall(noteSkillDefinition, agentParams);
    expect(result.success).toBe(true);
  });

  it("handles minimal parameters", async () => {
    // Agent provides only required fields
    const result = await resolveSkillCall(noteSkillDefinition, {
      title: "Quick note",
      body: "Remember to check the logs",
    });
    expect(result.success).toBe(true);
    expect(result.note.tags).toEqual([]); // Default applied
  });

  it("handles unexpected extra parameters gracefully", async () => {
    // Agents sometimes hallucinate extra parameters
    const result = await resolveSkillCall(noteSkillDefinition, {
      title: "Note",
      body: "Content",
      priority: "high", // Not in our schema
      notebook: "personal", // Not in our schema
    });
    // Should succeed, ignoring unknown parameters
    expect(result.success).toBe(true);
  });
});

Testing parameter type coercion

Agents sometimes send parameters with slightly wrong types. A number might arrive as a string, or a boolean as the string “true”. Decide whether your skill should handle this gracefully or reject it, and test accordingly:

it("handles numeric string for max_results", async () => {
  // Agent sends "10" instead of 10
  const result = await resolveSkillCall(searchSkillDefinition, {
    query: "handleSubmit",
    max_results: "10", // String instead of number
  });

  // Decide: should this work or fail with guidance?
  // If you coerce: verify it works
  expect(result.matches.length).toBeLessThanOrEqual(10);

  // If you reject: verify the error is helpful
  // expect(result.error).toContain("must be a number");
});

Testing the description’s effectiveness

You can’t unit-test whether an LLM will select the right skill, but you can create scenario-based tests that document expected behavior:

/**
 * Scenario tests document which skill should be selected
 * for common user requests. These serve as living documentation
 * and regression tests when you modify descriptions.
 */
const selectionScenarios = [
  {
    request: "Save these meeting notes",
    expectedSkill: "create_note",
    reason: "Creating new content -> create_note",
  },
  {
    request: "What did we discuss in last week's meeting?",
    expectedSkill: "search_notes",
    reason: "Searching existing content -> search_notes",
  },
  {
    request: "Update the project roadmap note with Q3 goals",
    expectedSkill: "update_note",
    reason: "Modifying existing content -> update_note",
  },
  {
    request: "Create a new file called roadmap.md",
    expectedSkill: "write_file",
    reason: "Creating a file (not a note) -> write_file",
  },
];

Even if you run these manually against a real LLM rather than in CI, documenting them forces you to think about the boundaries between skills.

Level 3: snapshot testing for output stability

Agents downstream of your skill depend on the output shape. If you accidentally rename a field from totalCount to total_count, every agent workflow that references that field breaks silently. Snapshot tests catch these regressions.

Response shape snapshots

import { describe, it, expect } from "vitest";

describe("search skill output stability", () => {
  it("matches the expected response shape", async () => {
    const result = await searchFiles("**/*.ts", testDir);

    // Snapshot the shape, not the values
    expect(Object.keys(result).sort()).toMatchInlineSnapshot(`
      [
        "matches",
        "searchPath",
        "totalMatches",
        "truncated",
      ]
    `);

    // Snapshot the match object shape
    if (result.matches.length > 0) {
      expect(Object.keys(result.matches[0]).sort()).toMatchInlineSnapshot(`
        [
          "modified",
          "path",
        ]
      `);
    }
  });

  it("matches the expected error shape", async () => {
    const result = await searchFilesWithErrorHandling("", testDir);

    expect(Object.keys(result).sort()).toMatchInlineSnapshot(`
      [
        "error",
        "success",
        "suggestion",
      ]
    `);
  });
});

Schema validation tests

For stronger guarantees, validate responses against a JSON schema:

import Ajv from "ajv";

const successSchema = {
  type: "object",
  required: ["matches", "totalMatches", "truncated", "searchPath"],
  properties: {
    matches: {
      type: "array",
      items: {
        type: "object",
        required: ["path", "modified"],
        properties: {
          path: { type: "string" },
          modified: { type: "string" },
        },
      },
    },
    totalMatches: { type: "number" },
    truncated: { type: "boolean" },
    searchPath: { type: "string" },
  },
  additionalProperties: false,
};

it("success response matches the published schema", async () => {
  const ajv = new Ajv();
  const validate = ajv.compile(successSchema);
  const result = await searchFiles("**/*.ts", testDir);

  const valid = validate(result);
  if (!valid) {
    // Print detailed validation errors
    console.error(validate.errors);
  }
  expect(valid).toBe(true);
});

Level 4: end-to-end agent testing

The real test is whether an agent can use your skill to accomplish a task. These tests are slower and less deterministic, but they catch issues that no other level can.

Scripted agent scenarios

import { describe, it, expect } from "vitest";
import { createTestAgent } from "./test-helpers";

describe("note-taking workflow", () => {
  it("agent can create, search, and update a note", async () => {
    const agent = createTestAgent({
      skills: [createNoteSkill, searchNotesSkill, updateNoteSkill],
    });

    // Step 1: Create
    const createResult = await agent.execute(
      "Save a note titled 'Project Alpha Kickoff' with the body " +
        "'Meeting scheduled for April 1. Invite engineering team.'",
    );
    expect(createResult.skillCalls).toContainEqual(
      expect.objectContaining({ name: "create_note" }),
    );

    // Step 2: Search
    const searchResult = await agent.execute(
      "Find my notes about Project Alpha",
    );
    expect(searchResult.skillCalls).toContainEqual(
      expect.objectContaining({ name: "search_notes" }),
    );

    // Step 3: Update
    const updateResult = await agent.execute(
      "Update the Project Alpha Kickoff note to change the date to April 5",
    );
    expect(updateResult.skillCalls).toContainEqual(
      expect.objectContaining({ name: "update_note" }),
    );
  }, 30000); // Longer timeout for agent tests
});

Handling non-determinism

Agent tests are inherently non-deterministic. Handle this with:

  1. Seed the LLM temperature to 0 when available, to minimize variation
  2. Assert on skill selection, not exact parameters. The agent might phrase the query differently each run.
  3. Use retry logic for flaky assertions with a low retry count (2-3 attempts)
  4. Separate deterministic from non-deterministic tests. Run agent tests on a different schedule from unit tests.

Debugging techniques

When a skill misbehaves in production, you need visibility into what happened. Build debugging affordances into your skills from the start.

Structured logging

Log every skill invocation with enough context to reconstruct what happened:

import { logger } from "./logger";

async function executeSkill(name: string, params: unknown) {
  const invocationId = crypto.randomUUID();
  const startTime = Date.now();

  logger.info({
    event: "skill_invocation_start",
    invocationId,
    skill: name,
    params: sanitizeForLogging(params), // Remove sensitive fields
  });

  try {
    const result = await handlers[name](params);
    const duration = Date.now() - startTime;

    logger.info({
      event: "skill_invocation_complete",
      invocationId,
      skill: name,
      duration,
      success: result.success !== false,
      resultSize: JSON.stringify(result).length,
    });

    return result;
  } catch (err) {
    const duration = Date.now() - startTime;

    logger.error({
      event: "skill_invocation_error",
      invocationId,
      skill: name,
      duration,
      error: err.message,
      stack: err.stack,
    });

    throw err;
  }
}

Sanitizing logs

Never log sensitive data. Create a sanitization function that strips known sensitive fields:

function sanitizeForLogging(params: Record<string, unknown>) {
  const sensitiveFields = [
    "password",
    "token",
    "secret",
    "api_key",
    "authorization",
    "cookie",
    "credit_card",
  ];

  const sanitized = { ...params };
  for (const field of sensitiveFields) {
    if (field in sanitized) {
      sanitized[field] = "[REDACTED]";
    }
  }
  return sanitized;
}

Invocation replay

For tricky debugging scenarios, capture the full invocation context so you can replay it locally:

// In production: capture invocations to a replay log
async function captureInvocation(
  skill: string,
  params: unknown,
  result: unknown,
) {
  if (process.env.CAPTURE_REPLAYS === "true") {
    const replay = {
      timestamp: new Date().toISOString(),
      skill,
      params,
      result,
    };
    await appendFile(
      "/var/log/skill-replays.jsonl",
      JSON.stringify(replay) + "\n",
    );
  }
}

// Locally: replay a captured invocation
async function replayInvocation(replayFile: string, lineNumber: number) {
  const lines = (await readFile(replayFile, "utf-8")).split("\n");
  const replay = JSON.parse(lines[lineNumber]);

  console.log("Replaying:", replay.skill, replay.params);
  const result = await handlers[replay.skill](replay.params);
  console.log("Expected:", replay.result);
  console.log("Got:", result);
}

Common debugging scenarios

The agent never selects the skill. Check the description. Does it match the scenarios you expect? Try listing all available skill descriptions and reading them as an LLM would. Often the problem is that another skill’s description sounds like a better match.

The agent selects the skill but passes wrong parameters. Check parameter names and descriptions. Are there ambiguous names? Are examples provided? Try adding a concrete example to the parameter description.

The skill returns success but the agent seems confused. Check the output format. Is the response structured or freeform text? Does it include enough metadata for the agent to decide what to do next? A common issue is a missing truncated flag, so the agent doesn’t know there are more results.

The skill works locally but fails in production. Check environment differences: file permissions, network access, environment variables, database connections. Add structured logging around the failure point and check the logs.

CI pipelines for skill validation

Skills should be validated in CI on every change. Here’s a pipeline structure that covers all four testing levels:

# .github/workflows/skill-tests.yml
name: Skill Tests

on: [push, pull_request]

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: "20"
      - run: npm ci
      - run: npm run test:unit

  integration-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: "20"
      - run: npm ci
      - run: npm run test:integration

  snapshot-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: "20"
      - run: npm ci
      - run: npm run test:snapshots
      # Fail if snapshots need updating (prevents accidental changes)
      - run: git diff --exit-code **/*.snap

  schema-validation:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: "20"
      - run: npm ci
      - run: npm run validate:schemas
      # Verify all skill definitions are valid JSON Schema
      - run: npm run validate:skill-definitions

What CI should catch

CheckPurpose
Unit tests passCore logic is correct
Integration tests passSkills handle realistic agent invocations
Snapshots matchOutput shape hasn’t changed unintentionally
Schema validationSkill definitions are valid and complete
Lint passCode quality standards are met
Description length checkDescriptions are not too terse (under 50 chars) or too long (over 2000 chars)
Required field checkAll skills have name, description, and parameters

Adding a description quality gate

You can lint skill descriptions to enforce minimum quality:

// scripts/validate-descriptions.ts
import { allSkillDefinitions } from "../src/skills";

for (const skill of allSkillDefinitions) {
  const desc = skill.description;

  if (desc.length < 50) {
    console.error(
      `${skill.name}: description too short (${desc.length} chars, min 50)`,
    );
    process.exit(1);
  }

  if (!desc.toLowerCase().includes("use this when")) {
    console.warn(`${skill.name}: description missing "Use this when" guidance`);
  }

  if (!desc.toLowerCase().includes("do not use")) {
    console.warn(`${skill.name}: description missing "Do not use" guidance`);
  }

  if (!desc.toLowerCase().includes("return")) {
    console.warn(`${skill.name}: description does not describe return format`);
  }
}

Summary

Testing agent skills requires thinking at multiple levels:

  1. Unit tests verify the logic works correctly in isolation
  2. Integration tests verify realistic agent invocations are handled well
  3. Snapshot tests protect against accidental output shape changes
  4. End-to-end tests verify the skill works within a real agent workflow

Combine this with structured logging, invocation replay, and CI validation to build skills you can deploy with confidence. For the security side of skill deployment, continue to Security Considerations for Agent Skills.