SystemsW0915 min read

Planning and Task Decomposition

From classical planning (STRIPS, PDDL) to LLM-based natural language planning. Hierarchical task decomposition, plan validation and refinement, dynamic replanning when execution drifts. Integration with the architectures from week 5.

Core conceptsHierarchical decompositionPlan validationDynamic replanning

01Learning Objectives

By the end of this lecture, students will be able to:

Trace the evolution of AI planning from classical approaches (STRIPS, PDDL) to LLM-based natural language planning.
Explain hierarchical task decomposition and its advantages for complex agent tasks.
Design plan generation, validation, and refinement pipelines for LLM-based agents.
Implement dynamic replanning strategies that allow agents to adapt when execution deviates from the plan.
Construct task graphs with dependency management for parallel and sequential execution.
Compare modern planning approaches including Tree of Thoughts and Language Agent Tree Search (LATS).
Build a task decomposition agent in Python that breaks complex goals into executable subtasks.

021. Classical AI Planning

A Brief History

Planning has been a central topic in AI since the field's inception -- arguably, it is the original AI problem. The challenge: given a description of the world, a set of available actions, and a goal, find a sequence of actions that transforms the current world state into one that satisfies the goal.

Think about what planning requires. When you plan a road trip, you consider: where you are now, where you want to be, what routes are available, how long each takes, and what constraints you have (budget, time, preferences). You mentally simulate different sequences of actions and choose the one that best achieves your goal. AI planning formalizes this process.

Understanding classical planning is valuable even when working with modern LLM-based systems because it reveals what we gain (flexibility, generality) and what we lose (guarantees, optimality) when we move from formal to natural-language planning.

STRIPS (1971)

The Stanford Research Institute Problem Solver (STRIPS), developed by Fikes and Nilsson (1971), was one of the first automated planning systems. It introduced a formal representation that remains influential:

State: A set of logical predicates describing the world (e.g., on(A, B), clear(A), on_table(B)).
Actions: Defined by preconditions and effects.
- Preconditions: What must be true for the action to be applicable.
- Add effects: What becomes true after the action.
- Delete effects: What becomes false after the action.
Goal: A set of predicates that must be true in the final state.

text

Action: move(Block, From, To)
  Preconditions: on(Block, From), clear(Block), clear(To)
  Add effects:   on(Block, To), clear(From)
  Delete effects: on(Block, From), clear(To)

STRIPS planning works by searching through the space of possible states, applying actions whose preconditions are met, and checking whether the resulting state satisfies the goal.

PDDL (1998)

The Planning Domain Definition Language (PDDL), introduced by McDermott et al. (1998), standardized the representation of planning problems. PDDL separates the domain (what types of objects exist and what actions are possible) from the problem (what specific objects exist, what the initial state is, and what the goal is).

text

(define (domain logistics)
  (:requirements :strips)
  (:types city location package vehicle)

  (:action drive
    :parameters (?v - vehicle ?from - location ?to - location ?c - city)
    :precondition (and (in-city ?from ?c) (in-city ?to ?c) (at ?v ?from))
    :effect (and (at ?v ?to) (not (at ?v ?from)))
  )

  (:action load-package
    :parameters (?p - package ?v - vehicle ?l - location)
    :precondition (and (at ?v ?l) (at ?p ?l))
    :effect (and (in ?p ?v) (not (at ?p ?l)))
  )
)

PDDL has been extended over the years to support durative actions (with time), numeric fluents, preferences, and probabilistic effects. It remains the standard language for the International Planning Competition (IPC).

Strengths and Limitations of Classical Planning

Strengths:

Provably correct: if a plan is found, it is guaranteed to achieve the goal.
Complete: if a plan exists, the planner will find it (given enough time).
Optimal: many planners find plans that minimize some cost metric.

Limitations:

Requires complete, formal domain descriptions (which are hard to write for real-world tasks). Try writing a PDDL domain for "plan a birthday party" and you will appreciate this challenge.
Brittle: any discrepancy between the model and reality causes plan failure. If the restaurant you planned to eat at is closed, the entire plan breaks.
Does not handle uncertainty, partial observability, or continuous state spaces well.
Cannot handle open-ended, ill-defined goals ("write a good essay" is not a valid PDDL goal).

Key Insight: Classical planning gives you guarantees that LLM-based planning cannot match (correctness, completeness, optimality). But it requires you to formalize the world completely, which is impractical for most real-world tasks. LLM-based planning trades those guarantees for flexibility and generality. Understanding this trade-off is essential for choosing the right approach.

032. LLM-Based Planning: Natural Language Plans

The Paradigm Shift

LLM-based planning represents a radical departure from classical AI planning. Instead of formal state representations and action schemas, you describe the task in plain English and ask the LLM to generate a plan. No PDDL, no preconditions, no formal effects -- just a prompt.

The analogy is the difference between giving directions using formal coordinates ("Go to 40.7128N, 74.0060W") versus natural language ("Go to the corner of Broadway and Wall Street, then turn left"). Natural language is less precise but much more practical for most situations.

python

PLANNING_PROMPT = """You are a planning agent. Given a goal, generate a step-by-step
plan to achieve it. Each step should be:
- Specific and actionable
- Ordered logically (dependencies respected)
- Achievable with the available tools

Goal: {goal}

Available tools:
{tools}

Generate a numbered plan:"""


def generate_plan(goal: str, tools: list[str], llm_call) -> list[str]:
    """Generate a natural language plan for a goal.

    Args:
        goal: The goal to achieve.
        tools: List of available tool descriptions.
        llm_call: Function to call the LLM.

    Returns:
        List of plan steps as strings.
    """
    tools_str = "\n".join(f"- {tool}" for tool in tools)
    prompt = PLANNING_PROMPT.format(goal=goal, tools=tools_str)
    response = llm_call(prompt=prompt)

    # Parse numbered steps
    steps = []
    for line in response.strip().split("\n"):
        line = line.strip()
        if line and line[0].isdigit():
            # Remove the number prefix
            step = line.lstrip("0123456789.)")
            steps.append(step.strip())

    return steps

Advantages of LLM-Based Planning

No formal domain description needed: The LLM uses its trained knowledge to understand actions and their effects.
Handles ambiguity: Natural language goals like "make the report better" can be interpreted and planned for.
Flexible and general: The same planner works across domains without reconfiguration.
Incorporates world knowledge: The LLM knows that "booking a flight requires a passport" without being told.

Limitations of LLM-Based Planning

No correctness guarantees: The plan may contain logical errors, impossible steps, or missing dependencies.
Hallucinated actions: The LLM may suggest actions that are not actually available.
Poor at long-horizon planning: As plans get longer, LLMs struggle to maintain consistency and track dependencies.
No optimality: The generated plan is rarely optimal in terms of steps, cost, or time.

The Plan-and-Execute Pattern

A practical approach combines LLM planning with step-by-step execution:

python

class PlanAndExecuteAgent:
    """An agent that first plans, then executes step by step.

    This separates the planning phase (creative, high-level) from the
    execution phase (precise, tool-using), allowing each to be optimized
    independently.
    """

    def __init__(self, planner_llm, executor_llm, tools: dict[str, callable]):
        self.planner_llm = planner_llm
        self.executor_llm = executor_llm
        self.tools = tools

    def plan(self, goal: str) -> list[str]:
        """Generate a plan for the goal."""
        tools_desc = "\n".join(f"- {name}" for name in self.tools.keys())
        prompt = PLANNING_PROMPT.format(goal=goal, tools=tools_desc)
        response = self.planner_llm(prompt=prompt)
        return self._parse_steps(response)

    def execute_step(self, step: str, context: str) -> str:
        """Execute a single step of the plan.

        Args:
            step: The step to execute.
            context: Results from previous steps.

        Returns:
            The result of executing the step.
        """
        prompt = f"""Execute the following step of a plan.

Previous context:
{context}

Current step: {step}

Available tools: {list(self.tools.keys())}

Determine which tool to use (if any) and execute the step.
Respond with the result."""

        response = self.executor_llm(prompt=prompt)

        # Check if the executor wants to use a tool
        for tool_name, tool_fn in self.tools.items():
            if tool_name.lower() in response.lower():
                # Extract arguments and call the tool
                tool_result = tool_fn(step)
                return f"Tool '{tool_name}' result: {tool_result}"

        return response

    def run(self, goal: str) -> dict:
        """Plan and execute a goal.

        Returns:
            Dict with the plan, execution results, and final outcome.
        """
        # Phase 1: Plan
        plan = self.plan(goal)
        print(f"Plan generated with {len(plan)} steps:")
        for i, step in enumerate(plan, 1):
            print(f"  {i}. {step}")

        # Phase 2: Execute
        results = []
        context = ""
        for i, step in enumerate(plan):
            print(f"\nExecuting step {i + 1}: {step}")
            result = self.execute_step(step, context)
            results.append({"step": step, "result": result})
            context += f"\nStep {i + 1} ({step}): {result}"
            print(f"  Result: {result[:200]}")

        return {
            "goal": goal,
            "plan": plan,
            "results": results,
            "final_context": context,
        }

    def _parse_steps(self, response: str) -> list[str]:
        """Parse numbered steps from LLM response."""
        steps = []
        for line in response.strip().split("\n"):
            line = line.strip()
            if line and (line[0].isdigit() or line.startswith("-")):
                step = line.lstrip("0123456789.)- ")
                if step:
                    steps.append(step)
        return steps

043. Hierarchical Task Decomposition

Why Hierarchical Decomposition?

Complex tasks are difficult to plan in a single pass. This is true for humans as well as AI agents. If you ask someone to "build a web application for managing a restaurant," they do not immediately start writing code. They think at a high level first (what features do we need?), then drill down into specifics (how should the reservation system work?), and only then reach the level of individual coding tasks.

This is hierarchical decomposition: breaking a big problem into smaller problems, and those smaller problems into even smaller ones, until each piece is small enough to execute directly. It is how we manage complexity in software engineering (modules, functions), in organizations (departments, teams), and in everyday life (planning a wedding involves planning the venue, catering, invitations, etc., each of which breaks down further).

A goal like "build a web application for managing a restaurant" cannot be meaningfully decomposed into a flat list of atomic steps. It requires multiple levels of abstraction:

High level: Design the system architecture, build the frontend, build the backend, set up the database, deploy.
Medium level: For "build the frontend," decompose into: set up the project, create the layout, build the menu page, build the reservation system, add authentication.
Low level: For "build the menu page," decompose into: create the data model, build the API endpoint, create the React component, add styling, write tests.

Hierarchical Task Networks (HTN)

Hierarchical Task Networks are a classical AI planning formalism that matches this intuition. In HTN planning:

Primitive tasks are directly executable actions.
Compound tasks must be decomposed into subtasks using methods.
Methods define how a compound task can be decomposed, including ordering constraints between subtasks.

text

Compound Task: prepare_dinner
  Method 1: home_cooking
    Subtasks: [choose_recipe, buy_ingredients, cook, serve]
  Method 2: order_delivery
    Subtasks: [choose_restaurant, place_order, wait, receive]

LLM-Based Hierarchical Decomposition

LLMs naturally perform hierarchical decomposition when prompted appropriately:

python

DECOMPOSITION_PROMPT = """Break down the following task into 3-7 subtasks.
Each subtask should be:
- More specific than the parent task
- Independent enough to be worked on separately
- Small enough to be completed in a focused work session

If a subtask is still complex, mark it with [COMPLEX] so it can be
further decomposed.

Task: {task}

Subtasks:"""


def hierarchical_decompose(
    task: str, llm_call, max_depth: int = 3, current_depth: int = 0
) -> dict:
    """Recursively decompose a task into a hierarchy of subtasks.

    Args:
        task: The task to decompose.
        llm_call: Function to call the LLM.
        max_depth: Maximum decomposition depth.
        current_depth: Current depth in the recursion.

    Returns:
        A tree structure representing the task hierarchy.
    """
    if current_depth >= max_depth:
        return {"task": task, "subtasks": [], "leaf": True}

    response = llm_call(prompt=DECOMPOSITION_PROMPT.format(task=task))
    subtasks = _parse_subtasks(response)

    tree = {"task": task, "subtasks": [], "leaf": False}

    for subtask_text in subtasks:
        is_complex = "[COMPLEX]" in subtask_text
        clean_text = subtask_text.replace("[COMPLEX]", "").strip()

        if is_complex and current_depth < max_depth - 1:
            # Recursively decompose complex subtasks
            subtree = hierarchical_decompose(
                clean_text, llm_call, max_depth, current_depth + 1
            )
            tree["subtasks"].append(subtree)
        else:
            tree["subtasks"].append({
                "task": clean_text,
                "subtasks": [],
                "leaf": True,
            })

    return tree


def _parse_subtasks(response: str) -> list[str]:
    """Parse subtasks from LLM response."""
    subtasks = []
    for line in response.strip().split("\n"):
        line = line.strip()
        if line and (line[0].isdigit() or line.startswith("-")):
            subtask = line.lstrip("0123456789.)- ")
            if subtask:
                subtasks.append(subtask)
    return subtasks


def print_task_tree(tree: dict, indent: int = 0) -> None:
    """Pretty-print a task tree."""
    prefix = "  " * indent
    marker = "[leaf]" if tree["leaf"] else "[node]"
    print(f"{prefix}{marker} {tree['task']}")
    for subtask in tree.get("subtasks", []):
        print_task_tree(subtask, indent + 1)

054. Plan Validation and Refinement

Why Plans Need Validation

LLM-generated plans are not guaranteed to be correct. Common issues include:

Missing steps: Critical prerequisites are overlooked.
Wrong ordering: Steps that depend on each other are in the wrong order.
Infeasible steps: Steps that cannot be executed with available tools.
Redundant steps: Unnecessary actions that waste resources.
Vague steps: Steps that are too abstract to execute directly.

Plan Validation Strategies

Self-Validation

Ask the LLM to critique its own plan:

python

VALIDATION_PROMPT = """Review the following plan for achieving the stated goal.
Check for:
1. Missing steps: Are there any prerequisites or necessary actions not included?
2. Ordering errors: Are steps in the right order? Are dependencies respected?
3. Feasibility: Can each step be executed with the available tools?
4. Redundancy: Are there unnecessary or duplicate steps?
5. Specificity: Is each step specific enough to be actionable?

Goal: {goal}

Plan:
{plan}

Available tools: {tools}

Provide your assessment, listing any issues found and suggesting corrections."""


def validate_plan(
    goal: str, plan: list[str], tools: list[str], llm_call
) -> dict:
    """Validate a plan and identify issues.

    Returns:
        Dict with validation results and suggested fixes.
    """
    plan_str = "\n".join(f"{i+1}. {step}" for i, step in enumerate(plan))
    tools_str = ", ".join(tools)

    response = llm_call(
        prompt=VALIDATION_PROMPT.format(goal=goal, plan=plan_str, tools=tools_str)
    )

    return {
        "original_plan": plan,
        "validation_feedback": response,
        "needs_revision": any(
            keyword in response.lower()
            for keyword in ["missing", "error", "incorrect", "should be", "add"]
        ),
    }

Cross-Validation with a Critic Agent

Use a separate agent (or a separate LLM call with a different prompt) to review the plan. This provides a form of peer review:

python

CRITIC_PROMPT = """You are a plan reviewer. Your job is to find flaws in plans
before they are executed. Be thorough and skeptical.

Goal: {goal}
Plan:
{plan}

For each step, assess:
- Is this step necessary?
- Is it in the correct position?
- What could go wrong?
- Is it specific enough?

Rate the overall plan quality (1-10) and list all issues."""

Simulation-Based Validation

For plans that involve code execution or API calls, validate by simulating execution in a sandbox:

python

def simulate_plan_execution(
    plan: list[str], tools: dict[str, callable], dry_run: bool = True
) -> list[dict]:
    """Simulate plan execution to catch errors before real execution.

    Args:
        plan: List of plan steps.
        tools: Available tools.
        dry_run: If True, tools log what they would do without executing.

    Returns:
        List of simulation results for each step.
    """
    results = []
    state = {}

    for i, step in enumerate(plan):
        result = {
            "step": i + 1,
            "description": step,
            "status": "unknown",
            "issues": [],
        }

        # Check if the step references a known tool
        tool_found = False
        for tool_name in tools:
            if tool_name.lower() in step.lower():
                tool_found = True
                if dry_run:
                    result["status"] = "simulated"
                    result["output"] = f"Would execute: {tool_name}"
                else:
                    try:
                        output = tools[tool_name](step)
                        result["status"] = "success"
                        result["output"] = output
                    except Exception as e:
                        result["status"] = "error"
                        result["issues"].append(str(e))
                break

        if not tool_found:
            result["status"] = "no_tool"
            result["issues"].append("No matching tool found for this step.")

        results.append(result)

    return results

Plan Refinement

Based on validation feedback, refine the plan:

python

REFINEMENT_PROMPT = """Revise the following plan based on the feedback provided.
Make minimal changes to fix the identified issues while preserving the parts
that are correct.

Original plan:
{plan}

Feedback:
{feedback}

Revised plan:"""


def refine_plan(
    plan: list[str], feedback: str, llm_call
) -> list[str]:
    """Refine a plan based on validation feedback."""
    plan_str = "\n".join(f"{i+1}. {step}" for i, step in enumerate(plan))
    response = llm_call(
        prompt=REFINEMENT_PROMPT.format(plan=plan_str, feedback=feedback)
    )
    return _parse_subtasks(response)

The Plan-Validate-Refine Loop

Combining these components into an iterative improvement loop:

python

def iterative_planning(
    goal: str, tools: list[str], llm_call, max_iterations: int = 3
) -> list[str]:
    """Generate and iteratively refine a plan.

    Args:
        goal: The goal to plan for.
        tools: Available tools.
        llm_call: Function to call the LLM.
        max_iterations: Maximum refinement iterations.

    Returns:
        The final refined plan.
    """
    # Generate initial plan
    plan = generate_plan(goal, tools, llm_call)
    print(f"Initial plan: {len(plan)} steps")

    for iteration in range(max_iterations):
        # Validate
        validation = validate_plan(goal, plan, tools, llm_call)

        if not validation["needs_revision"]:
            print(f"Plan validated after {iteration + 1} iteration(s).")
            return plan

        # Refine
        print(f"Iteration {iteration + 1}: Refining plan based on feedback...")
        plan = refine_plan(plan, validation["validation_feedback"], llm_call)

    print(f"Reached max iterations ({max_iterations}). Returning best plan.")
    return plan

065. Dynamic Replanning

When Plans Meet Reality

The military strategist Helmuth von Moltke famously said: "No plan survives contact with the enemy." The same is true for AI agents -- no plan survives contact with reality unchanged. The ability to adapt when things go wrong is what separates a robust agent from a fragile one.

Key Insight: Planning ability is necessary but not sufficient for agency. The real test of an agent is not whether it can make a plan, but whether it can recover when the plan fails. Dynamic replanning -- detecting failures, diagnosing causes, and generating alternative approaches -- is what makes agents robust in real-world environments.

Agents encounter:

Tool failures: An API returns an error, a file does not exist.
Unexpected results: A search returns no results, code produces wrong output.
Changed conditions: New information invalidates previous assumptions.
Blocked steps: A step cannot be executed due to missing permissions or resources.

Dynamic replanning allows agents to adapt:

python

class DynamicPlanExecutor:
    """Execute a plan with dynamic replanning on failure.

    When a step fails, the executor asks the LLM to generate an
    alternative approach or revise the remaining plan.
    """

    def __init__(self, llm_call, tools: dict[str, callable]):
        self.llm_call = llm_call
        self.tools = tools

    def execute_with_replanning(
        self, goal: str, initial_plan: list[str], max_replans: int = 3
    ) -> dict:
        """Execute a plan, replanning when steps fail.

        Args:
            goal: The original goal.
            initial_plan: The initial plan.
            max_replans: Maximum number of times to replan.

        Returns:
            Execution results including any replanning events.
        """
        plan = list(initial_plan)
        execution_log = []
        replans = 0
        step_idx = 0

        while step_idx < len(plan):
            step = plan[step_idx]
            print(f"Executing step {step_idx + 1}/{len(plan)}: {step}")

            try:
                result = self._execute_step(step)
                execution_log.append({
                    "step": step,
                    "status": "success",
                    "result": result,
                })
                print(f"  Success: {result[:100]}")
                step_idx += 1

            except Exception as e:
                error_msg = str(e)
                print(f"  Failed: {error_msg}")
                execution_log.append({
                    "step": step,
                    "status": "failed",
                    "error": error_msg,
                })

                if replans >= max_replans:
                    print("  Max replans reached. Aborting.")
                    break

                # Replan
                print("  Replanning...")
                remaining_steps = plan[step_idx:]
                completed_context = "\n".join(
                    f"- {log['step']}: {log.get('result', log.get('error', 'N/A'))[:100]}"
                    for log in execution_log
                )

                new_plan = self._replan(
                    goal, completed_context, remaining_steps, error_msg
                )

                if new_plan:
                    plan = plan[:step_idx] + new_plan
                    replans += 1
                    print(f"  New plan ({len(new_plan)} steps remaining)")
                else:
                    print("  Replanning failed. Skipping step.")
                    step_idx += 1

        return {
            "goal": goal,
            "execution_log": execution_log,
            "replans": replans,
            "completed": step_idx >= len(plan),
        }

    def _execute_step(self, step: str) -> str:
        """Execute a single step using available tools."""
        for tool_name, tool_fn in self.tools.items():
            if tool_name.lower() in step.lower():
                return tool_fn(step)
        # If no tool matches, use the LLM to reason about the step
        return self.llm_call(prompt=f"Execute this step and report the result: {step}")

    def _replan(
        self, goal: str, completed: str, failed_steps: list[str], error: str
    ) -> list[str] | None:
        """Generate a new plan given a failure.

        Args:
            goal: The original goal.
            completed: Summary of completed steps.
            failed_steps: The steps that failed or remain.
            error: The error that triggered replanning.

        Returns:
            A new list of steps, or None if replanning fails.
        """
        prompt = f"""A plan step failed during execution. Generate a revised plan
to achieve the goal, taking into account what has already been completed
and the error encountered.

Goal: {goal}

Completed steps:
{completed}

Failed step: {failed_steps[0] if failed_steps else 'N/A'}
Error: {error}

Remaining original steps:
{chr(10).join(f'- {s}' for s in failed_steps[1:])}

Generate a revised plan for the remaining work (avoid repeating completed steps):"""

        response = self.llm_call(prompt=prompt)
        steps = _parse_subtasks(response)
        return steps if steps else None

Replanning Strategies

Different situations call for different replanning strategies:

Failure Type	Strategy	Example
Tool temporarily unavailable	Retry with backoff	API rate limit hit
Tool returns error	Alternative tool	Switch from one search engine to another
Step produces wrong result	Modify step	Adjust query parameters
Step is fundamentally infeasible	Alternative approach	Rewrite the plan to use a different method
Goal itself is unclear	Clarification	Ask the user for more specifics

076. Task Graphs and Dependency Management

Beyond Linear Plans

Many real-world tasks are not purely sequential. Consider a software project: frontend and backend development can happen in parallel, but integration testing depends on both being complete. A database schema must be designed before the backend can implement queries, but the frontend can be built in parallel with the schema design.

A linear plan (step 1, step 2, step 3...) cannot express these relationships. A task graph -- a directed acyclic graph (DAG) where nodes are tasks and edges are dependencies -- captures them naturally. This is the same concept used by build systems (Make, Gradle), workflow orchestrators (Airflow, Prefect), and CI/CD pipelines.

Try It Yourself: Think about cooking a complete dinner with multiple dishes. Which tasks can be done in parallel (e.g., boiling pasta while grilling vegetables)? Which tasks have dependencies (e.g., cannot plate the dish before cooking it)? Draw a task graph for a three-course dinner. You will find that a good task graph significantly reduces total cooking time compared to a sequential approach.

python

from dataclasses import dataclass, field
from enum import Enum


class TaskStatus(Enum):
    PENDING = "pending"
    READY = "ready"      # All dependencies met
    RUNNING = "running"
    COMPLETED = "completed"
    FAILED = "failed"
    SKIPPED = "skipped"


@dataclass
class TaskNode:
    """A node in the task graph."""

    id: str
    description: str
    dependencies: list[str] = field(default_factory=list)  # IDs of prerequisite tasks
    status: TaskStatus = TaskStatus.PENDING
    result: str | None = None
    error: str | None = None

    def is_ready(self, completed_tasks: set[str]) -> bool:
        """Check if all dependencies are satisfied."""
        return all(dep in completed_tasks for dep in self.dependencies)


class TaskGraph:
    """A directed acyclic graph (DAG) of tasks with dependency management.

    Supports parallel execution of independent tasks and respects
    ordering constraints.
    """

    def __init__(self):
        self.nodes: dict[str, TaskNode] = {}

    def add_task(
        self, task_id: str, description: str, dependencies: list[str] | None = None
    ) -> None:
        """Add a task to the graph."""
        deps = dependencies or []
        # Validate dependencies exist
        for dep in deps:
            if dep not in self.nodes:
                raise ValueError(f"Dependency '{dep}' not found. Add it first.")
        self.nodes[task_id] = TaskNode(
            id=task_id, description=description, dependencies=deps
        )

    def get_ready_tasks(self) -> list[TaskNode]:
        """Get all tasks that are ready to execute (dependencies satisfied)."""
        completed = {
            node_id for node_id, node in self.nodes.items()
            if node.status == TaskStatus.COMPLETED
        }
        ready = []
        for node in self.nodes.values():
            if node.status == TaskStatus.PENDING and node.is_ready(completed):
                node.status = TaskStatus.READY
                ready.append(node)
        return ready

    def mark_completed(self, task_id: str, result: str) -> None:
        """Mark a task as completed with its result."""
        self.nodes[task_id].status = TaskStatus.COMPLETED
        self.nodes[task_id].result = result

    def mark_failed(self, task_id: str, error: str) -> None:
        """Mark a task as failed."""
        self.nodes[task_id].status = TaskStatus.FAILED
        self.nodes[task_id].error = error
        # Also skip dependent tasks
        self._propagate_failure(task_id)

    def _propagate_failure(self, failed_id: str) -> None:
        """Skip all tasks that depend on a failed task."""
        for node in self.nodes.values():
            if failed_id in node.dependencies and node.status == TaskStatus.PENDING:
                node.status = TaskStatus.SKIPPED
                node.error = f"Skipped because dependency '{failed_id}' failed."
                self._propagate_failure(node.id)

    def is_complete(self) -> bool:
        """Check if all tasks are completed (or skipped/failed)."""
        return all(
            node.status in (TaskStatus.COMPLETED, TaskStatus.FAILED, TaskStatus.SKIPPED)
            for node in self.nodes.values()
        )

    def get_execution_order(self) -> list[list[str]]:
        """Get the topological execution order as batches of parallel tasks.

        Returns:
            List of batches, where each batch contains tasks that can run in parallel.
        """
        completed: set[str] = set()
        remaining = set(self.nodes.keys())
        batches = []

        while remaining:
            batch = []
            for task_id in remaining:
                node = self.nodes[task_id]
                if all(dep in completed for dep in node.dependencies):
                    batch.append(task_id)

            if not batch:
                raise ValueError("Circular dependency detected!")

            batches.append(batch)
            completed.update(batch)
            remaining -= set(batch)

        return batches

    def visualize(self) -> str:
        """Create a text visualization of the task graph."""
        lines = ["Task Graph:"]
        batches = self.get_execution_order()
        for i, batch in enumerate(batches):
            lines.append(f"\n  Batch {i + 1} (parallel):")
            for task_id in batch:
                node = self.nodes[task_id]
                deps = f" (depends on: {', '.join(node.dependencies)})" if node.dependencies else ""
                status = f" [{node.status.value}]"
                lines.append(f"    - {task_id}: {node.description}{deps}{status}")
        return "\n".join(lines)


# Example usage
def build_example_graph() -> TaskGraph:
    """Build an example task graph for a data analysis project."""
    graph = TaskGraph()

    # Independent tasks (can run in parallel)
    graph.add_task("fetch_data", "Download dataset from API")
    graph.add_task("setup_env", "Set up Python virtual environment and install dependencies")

    # Depends on fetch_data
    graph.add_task("clean_data", "Clean and preprocess the raw data", ["fetch_data"])
    graph.add_task("explore_data", "Perform exploratory data analysis", ["fetch_data"])

    # Depends on clean_data and setup_env
    graph.add_task("train_model", "Train the ML model", ["clean_data", "setup_env"])

    # Depends on train_model and explore_data
    graph.add_task("evaluate", "Evaluate model on test set", ["train_model"])
    graph.add_task("generate_report", "Generate analysis report", ["evaluate", "explore_data"])

    return graph

Generating Task Graphs from Natural Language

An LLM can generate task graphs by identifying both tasks and their dependencies:

python

TASK_GRAPH_PROMPT = """Analyze the following goal and generate a task graph.
For each task, identify:
1. A unique short ID
2. A description
3. Which other tasks it depends on (by ID)

Output format (one task per line):
ID | Description | Dependencies (comma-separated IDs, or "none")

Goal: {goal}

Tasks:"""


def generate_task_graph(goal: str, llm_call) -> TaskGraph:
    """Generate a task graph from a natural language goal."""
    response = llm_call(prompt=TASK_GRAPH_PROMPT.format(goal=goal))

    graph = TaskGraph()
    tasks_to_add = []

    # Parse the response
    for line in response.strip().split("\n"):
        parts = [p.strip() for p in line.split("|")]
        if len(parts) >= 3:
            task_id = parts[0].lower().replace(" ", "_")
            description = parts[1]
            deps_str = parts[2]
            deps = (
                [] if deps_str.lower() == "none"
                else [d.strip().lower().replace(" ", "_") for d in deps_str.split(",")]
            )
            tasks_to_add.append((task_id, description, deps))

    # Add tasks in dependency order
    added = set()
    max_passes = len(tasks_to_add) * 2
    passes = 0
    while tasks_to_add and passes < max_passes:
        remaining = []
        for task_id, description, deps in tasks_to_add:
            if all(d in added for d in deps):
                graph.add_task(task_id, description, deps)
                added.add(task_id)
            else:
                remaining.append((task_id, description, deps))
        tasks_to_add = remaining
        passes += 1

    return graph

087. Decomposition Strategies

Top-Down Decomposition

Start with the high-level goal and recursively break it down into smaller subtasks. This is the most natural approach for LLMs and mirrors how humans plan complex projects.

Advantages: Maintains coherence with the overall goal; ensures all parts serve the main objective. Disadvantages: May miss bottom-up insights; early decomposition choices constrain later options.

Iterative Refinement

Start with a rough plan and progressively refine each step:

python

def iterative_refinement(
    goal: str, llm_call, refinement_rounds: int = 3
) -> list[str]:
    """Iteratively refine a plan through multiple passes.

    Round 1: Generate a high-level outline (3-5 major phases)
    Round 2: Expand each phase into specific steps
    Round 3: Add details, error handling, and validation steps
    """
    # Round 1: High-level outline
    outline = llm_call(
        prompt=f"Create a high-level outline (3-5 phases) for: {goal}"
    )

    # Round 2: Expand each phase
    expanded = llm_call(
        prompt=(
            f"Expand each phase of this outline into 2-4 specific, actionable steps.\n\n"
            f"Outline:\n{outline}\n\nExpanded plan:"
        )
    )

    # Round 3: Add details
    detailed = llm_call(
        prompt=(
            f"Review this plan and add any missing steps, error handling, "
            f"and validation checkpoints.\n\n"
            f"Plan:\n{expanded}\n\nDetailed plan:"
        )
    )

    return _parse_subtasks(detailed)

Least-to-Most Prompting

Introduced by Zhou et al. (2023), this approach decomposes a complex problem into a series of simpler subproblems, solving them from the simplest to the most complex. Each solution builds on the previous ones.

text

Complex problem: "Calculate the total cost of a round trip from Madrid to Tokyo,
including flights, hotels for 5 nights, and daily meals."

Decomposition:
1. What is the average cost of a round-trip flight from Madrid to Tokyo?
2. What is the average cost of a hotel in Tokyo per night?
3. What is the average daily meal budget in Tokyo?
4. Given answers to 1-3, what is the total cost?

098. Planning with Uncertainty and Incomplete Information

The Challenge

Real-world planning often involves:

Uncertain outcomes: Actions may succeed or fail probabilistically.
Incomplete information: The agent does not know everything about the current state.
Dynamic environments: The world changes while the agent is executing its plan.

Contingency Planning

Generate plans with explicit branches for different outcomes:

python

CONTINGENCY_PROMPT = """Generate a plan for the following goal, including
contingency branches for steps that might fail.

For each step, specify:
- The primary action
- What could go wrong
- The fallback action if it fails

Goal: {goal}

Plan with contingencies:"""


@dataclass
class ContingencyStep:
    """A plan step with contingency handling."""

    primary_action: str
    potential_failure: str
    fallback_action: str

    def to_text(self) -> str:
        return (
            f"DO: {self.primary_action}\n"
            f"  IF FAILS ({self.potential_failure}): {self.fallback_action}"
        )

Belief-Space Planning

Rather than planning over world states, plan over belief states (probability distributions over possible states). This is relevant for agents operating with partial observability:

text

Belief: "The file is probably in /data/ (80%) or /backup/ (20%)"

Plan:
1. Check /data/ for the file.
2. IF found: proceed with processing.
3. IF not found: check /backup/.
4. IF still not found: ask the user for the file location.

109. Comparison: Tree of Thoughts and LATS

The methods we have discussed so far (plan-and-execute, hierarchical decomposition, replanning) all generate a single plan and then try to execute it. But what if the first plan is not the best? What if there are multiple promising approaches and we want to explore several before committing?

Tree of Thoughts and Language Agent Tree Search take a different approach: they systematically explore multiple planning paths, evaluate each, and choose the best one. This trades computational cost for plan quality.

Tree of Thoughts (ToT)

Tree of Thoughts (Yao et al., 2023) extends chain-of-thought prompting by exploring multiple reasoning paths simultaneously. Instead of committing to one chain of reasoning, the system generates several candidates at each step and evaluates which are most promising -- much like a chess player considering several possible moves before choosing one. Instead of a single linear chain of thoughts, the LLM generates multiple candidate "thoughts" at each step and evaluates them to decide which paths to pursue.

Interactive · Tree of Thoughts Search with Pruning

Prompting lab

Switch strategies, watch the output

The same problem under four different prompting strategies. The difference is easier to see than to describe — click and compare.

Task

Strategy

Prompt sent to the model

System

Think step by step before answering.

User

A shirt costs 24 €. With a 25% discount applied, what do I pay?

Model outputChain-of-Thought

Let's think step by step.
1. Discount = 24 × 0.25 = 6
2. Final price = 24 − 6 = 18
Answer: 18 €

Key components:

Thought generation: The LLM proposes multiple candidate next steps.
State evaluation: The LLM (or a heuristic) evaluates how promising each partial solution is.
Search algorithm: BFS or DFS with pruning based on evaluations.

python

def tree_of_thoughts(
    problem: str,
    llm_call,
    num_thoughts: int = 3,
    max_depth: int = 4,
    beam_width: int = 2,
) -> str:
    """Simplified Tree of Thoughts implementation.

    Uses beam search over multiple reasoning paths.

    Args:
        problem: The problem to solve.
        llm_call: Function to call the LLM.
        num_thoughts: Number of candidate thoughts to generate at each step.
        max_depth: Maximum depth of the thought tree.
        beam_width: Number of paths to keep at each step.

    Returns:
        The best solution found.
    """
    # Initialize with the problem
    beams = [{"path": [], "text": problem, "score": 0.0}]

    for depth in range(max_depth):
        all_candidates = []

        for beam in beams:
            # Generate candidate next thoughts
            thoughts = llm_call(
                prompt=(
                    f"Problem: {problem}\n\n"
                    f"Progress so far:\n{beam['text']}\n\n"
                    f"Generate {num_thoughts} different possible next steps. "
                    f"Number each option."
                )
            )

            # Parse and evaluate each thought
            for thought in _parse_thoughts(thoughts):
                # Evaluate this partial solution
                evaluation = llm_call(
                    prompt=(
                        f"Rate how promising this partial solution is (1-10):\n\n"
                        f"Problem: {problem}\n"
                        f"Progress: {beam['text']}\n"
                        f"Next step: {thought}\n\n"
                        f"Score (just the number):"
                    )
                )

                try:
                    score = float(evaluation.strip().split()[0])
                except (ValueError, IndexError):
                    score = 5.0

                all_candidates.append({
                    "path": beam["path"] + [thought],
                    "text": f"{beam['text']}\n{thought}",
                    "score": score,
                })

        # Keep top beam_width candidates
        all_candidates.sort(key=lambda x: x["score"], reverse=True)
        beams = all_candidates[:beam_width]

    # Return the best path
    return beams[0]["text"] if beams else "No solution found."


def _parse_thoughts(response: str) -> list[str]:
    """Parse numbered thoughts from LLM response."""
    thoughts = []
    for line in response.strip().split("\n"):
        line = line.strip()
        if line and line[0].isdigit():
            thought = line.lstrip("0123456789.)- ").strip()
            if thought:
                thoughts.append(thought)
    return thoughts

Language Agent Tree Search (LATS)

LATS (Zhou et al., 2024) combines LLM-based planning with Monte Carlo Tree Search (MCTS). It treats the agent's decision process as a tree search problem:

Selection: Navigate the tree using UCB1 (Upper Confidence Bound) to balance exploration and exploitation.
Expansion: Use the LLM to generate new action candidates.
Simulation: Use the LLM to simulate the outcome of each action.
Backpropagation: Update the scores of visited nodes based on the simulation results.
Reflection: When a trajectory fails, generate a reflection that informs future attempts.

LATS is particularly powerful because it:

Systematically explores the action space (unlike greedy approaches).
Learns from failures through reflection.
Balances trying new approaches with exploiting known good paths.

Comparison

Feature	Chain of Thought	Tree of Thoughts	LATS
Exploration	Single path	Multiple paths	Systematic tree search
Evaluation	None (implicit)	LLM evaluation at each step	MCTS + LLM evaluation
Backtracking	No	Limited (beam search)	Full (MCTS backpropagation)
Learning from failure	No	No	Yes (reflection)
Cost	Low (1 LLM call)	Medium (many evaluations)	High (many simulations)
Best for	Simple reasoning	Creative/divergent tasks	Complex multi-step tasks

1110. Practical Example: A Task Decomposition Agent

Let us build a complete task decomposition agent that breaks complex goals into manageable subtasks, creates a task graph, and executes it.

python

"""
A task decomposition agent that:
1. Takes a complex goal
2. Decomposes it into subtasks with dependencies
3. Builds a task graph
4. Executes tasks in dependency order
5. Replans on failure

Requirements:
    pip install openai  (or any LLM client)
"""

from dataclasses import dataclass, field
from datetime import datetime


@dataclass
class SubTask:
    """A subtask produced by decomposition."""

    id: str
    description: str
    dependencies: list[str] = field(default_factory=list)
    estimated_complexity: str = "medium"  # low, medium, high
    status: str = "pending"
    result: str | None = None


DECOMPOSE_SYSTEM = """You are a task decomposition expert. Given a complex goal,
break it down into concrete, actionable subtasks.

Rules:
1. Each subtask should be completable in a single focused step.
2. Identify dependencies between subtasks explicitly.
3. Rate each subtask's complexity (low/medium/high).
4. Aim for 4-8 subtasks (not too few, not too many).
5. Make subtasks specific enough to execute without further clarification.

Output format (one task per line):
TASK_ID | Description | Dependencies (comma-separated IDs or "none") | Complexity
"""

EXECUTE_SYSTEM = """You are a task execution agent. Given a subtask and context
from previously completed subtasks, execute the task and report the result.

Be specific and thorough in your execution. If you produce any output
(text, code, analysis), include it in your response.

If you cannot complete the task, explain why specifically."""


class TaskDecompositionAgent:
    """An agent that decomposes and executes complex tasks."""

    def __init__(self, llm_call):
        self.llm_call = llm_call
        self.subtasks: dict[str, SubTask] = {}
        self.execution_log: list[dict] = []

    def decompose(self, goal: str) -> list[SubTask]:
        """Decompose a goal into subtasks.

        Args:
            goal: The complex goal to decompose.

        Returns:
            List of SubTask objects.
        """
        response = self.llm_call(
            system=DECOMPOSE_SYSTEM,
            prompt=f"Decompose this goal into subtasks:\n\n{goal}",
        )

        subtasks = self._parse_decomposition(response)
        for task in subtasks:
            self.subtasks[task.id] = task

        return subtasks

    def execute(self, goal: str, max_replans: int = 2) -> dict:
        """Decompose and execute a goal.

        Args:
            goal: The goal to achieve.
            max_replans: Maximum replanning attempts on failure.

        Returns:
            Execution report.
        """
        # Step 1: Decompose
        print(f"Goal: {goal}")
        print(f"{'='*60}")
        subtasks = self.decompose(goal)
        print(f"\nDecomposed into {len(subtasks)} subtasks:")
        for task in subtasks:
            deps = f" (after: {', '.join(task.dependencies)})" if task.dependencies else ""
            print(f"  [{task.id}] {task.description}{deps} [{task.estimated_complexity}]")

        # Step 2: Execute in dependency order
        print(f"\n{'='*60}")
        print("Executing...\n")
        completed = set()
        replans = 0

        while True:
            # Find ready tasks
            ready = [
                task for task in self.subtasks.values()
                if task.status == "pending"
                and all(dep in completed for dep in task.dependencies)
            ]

            if not ready:
                if all(t.status in ("completed", "failed", "skipped") for t in self.subtasks.values()):
                    break
                else:
                    print("Deadlock: no tasks are ready but some are still pending.")
                    break

            for task in ready:
                print(f"Executing [{task.id}]: {task.description}")

                # Build context from completed tasks
                context = self._build_context(completed)

                try:
                    result = self._execute_subtask(task, context)
                    task.status = "completed"
                    task.result = result
                    completed.add(task.id)
                    self.execution_log.append({
                        "task_id": task.id,
                        "status": "completed",
                        "result": result[:200],
                        "timestamp": datetime.now().isoformat(),
                    })
                    print(f"  Completed: {result[:150]}...")

                except Exception as e:
                    error = str(e)
                    print(f"  Failed: {error}")
                    task.status = "failed"
                    self.execution_log.append({
                        "task_id": task.id,
                        "status": "failed",
                        "error": error,
                        "timestamp": datetime.now().isoformat(),
                    })

                    # Replan if possible
                    if replans < max_replans:
                        print("  Attempting to replan...")
                        new_task = self._replan_task(task, error, context)
                        if new_task:
                            self.subtasks[new_task.id] = new_task
                            replans += 1
                            print(f"  Created replacement task [{new_task.id}]: {new_task.description}")

        # Step 3: Generate report
        report = self._generate_report(goal)
        return report

    def _execute_subtask(self, task: SubTask, context: str) -> str:
        """Execute a single subtask."""
        prompt = (
            f"Execute the following subtask:\n\n"
            f"Task: {task.description}\n\n"
            f"Context from previous steps:\n{context}\n\n"
            f"Execute the task and provide the result:"
        )
        return self.llm_call(system=EXECUTE_SYSTEM, prompt=prompt)

    def _build_context(self, completed_ids: set[str]) -> str:
        """Build context string from completed tasks."""
        if not completed_ids:
            return "(No previous steps completed yet.)"

        lines = []
        for task_id in completed_ids:
            task = self.subtasks[task_id]
            result_preview = task.result[:300] if task.result else "N/A"
            lines.append(f"[{task_id}] {task.description}\n  Result: {result_preview}")
        return "\n".join(lines)

    def _replan_task(self, failed_task: SubTask, error: str, context: str) -> SubTask | None:
        """Generate a replacement task for a failed one."""
        prompt = (
            f"The following task failed:\n"
            f"Task: {failed_task.description}\n"
            f"Error: {error}\n\n"
            f"Context:\n{context}\n\n"
            f"Generate an alternative approach to accomplish the same objective. "
            f"Respond with a single task description."
        )
        response = self.llm_call(prompt=prompt)
        if response.strip():
            new_id = f"{failed_task.id}_retry"
            return SubTask(
                id=new_id,
                description=response.strip(),
                dependencies=failed_task.dependencies,
                estimated_complexity=failed_task.estimated_complexity,
            )
        return None

    def _parse_decomposition(self, response: str) -> list[SubTask]:
        """Parse the LLM's decomposition into SubTask objects."""
        subtasks = []
        for line in response.strip().split("\n"):
            parts = [p.strip() for p in line.split("|")]
            if len(parts) >= 3:
                task_id = parts[0].lower().replace(" ", "_")
                description = parts[1]
                deps_str = parts[2]
                deps = (
                    [] if deps_str.lower() == "none"
                    else [d.strip().lower().replace(" ", "_") for d in deps_str.split(",")]
                )
                complexity = parts[3].strip().lower() if len(parts) > 3 else "medium"
                subtasks.append(SubTask(
                    id=task_id,
                    description=description,
                    dependencies=deps,
                    estimated_complexity=complexity,
                ))
        return subtasks

    def _generate_report(self, goal: str) -> dict:
        """Generate a final execution report."""
        completed = [t for t in self.subtasks.values() if t.status == "completed"]
        failed = [t for t in self.subtasks.values() if t.status == "failed"]
        skipped = [t for t in self.subtasks.values() if t.status == "skipped"]

        return {
            "goal": goal,
            "total_subtasks": len(self.subtasks),
            "completed": len(completed),
            "failed": len(failed),
            "skipped": len(skipped),
            "success_rate": len(completed) / max(len(self.subtasks), 1),
            "execution_log": self.execution_log,
            "subtask_results": {
                task_id: {
                    "description": task.description,
                    "status": task.status,
                    "result_preview": (task.result[:200] if task.result else None),
                }
                for task_id, task in self.subtasks.items()
            },
        }


# ── Usage Example ─────────────────────────────────────────────────

def main():
    """Demonstrate the task decomposition agent."""

    def mock_llm_call(system: str = "", prompt: str = "") -> str:
        """Simulate LLM calls for demonstration."""
        if "Decompose" in prompt:
            return """t1 | Research existing sentiment analysis approaches | none | low
t2 | Select and download a suitable dataset | none | low
t3 | Set up Python environment with required libraries | none | low
t4 | Implement data preprocessing pipeline | t2, t3 | medium
t5 | Train a baseline sentiment classifier | t4 | medium
t6 | Evaluate model and analyze errors | t5 | medium
t7 | Write up results and conclusions | t6, t1 | low"""
        elif "Execute" in prompt or "Execute" in system:
            if "research" in prompt.lower():
                return (
                    "Surveyed recent sentiment analysis approaches. Key findings: "
                    "BERT-based models achieve ~95% accuracy on SST-2. DistilBERT "
                    "offers a good speed/accuracy trade-off. Recent work explores "
                    "aspect-based sentiment analysis for finer-grained results."
                )
            elif "dataset" in prompt.lower():
                return (
                    "Selected the Stanford Sentiment Treebank (SST-2) dataset. "
                    "Downloaded 67,349 training samples and 872 test samples. "
                    "Binary sentiment labels (positive/negative)."
                )
            elif "environment" in prompt.lower():
                return (
                    "Created virtual environment. Installed: torch 2.1, "
                    "transformers 4.36, datasets 2.16, scikit-learn 1.3."
                )
            elif "preprocessing" in prompt.lower():
                return (
                    "Implemented preprocessing: tokenization with BertTokenizer, "
                    "max_length=128, padding and truncation. Created DataLoader "
                    "with batch_size=32."
                )
            elif "train" in prompt.lower():
                return (
                    "Trained DistilBERT for 3 epochs. Training loss decreased from "
                    "0.42 to 0.12. Validation accuracy: 91.2%."
                )
            elif "evaluate" in prompt.lower():
                return (
                    "Test accuracy: 90.8%. Precision: 0.91, Recall: 0.90, F1: 0.905. "
                    "Error analysis: model struggles with sarcasm and negation."
                )
            elif "write" in prompt.lower():
                return (
                    "Report completed. Key findings: DistilBERT achieves 90.8% accuracy "
                    "on SST-2 with 3 epochs of fine-tuning. Main failure modes are "
                    "sarcasm detection and complex negation patterns."
                )
        return "Task completed successfully."

    # Create and run the agent
    agent = TaskDecompositionAgent(mock_llm_call)
    report = agent.execute(
        "Build a sentiment analysis model: research approaches, train a model, "
        "evaluate it, and write up the results."
    )

    print(f"\n{'='*60}")
    print("EXECUTION REPORT")
    print(f"{'='*60}")
    print(f"Goal: {report['goal']}")
    print(f"Subtasks: {report['total_subtasks']}")
    print(f"Completed: {report['completed']}")
    print(f"Failed: {report['failed']}")
    print(f"Success rate: {report['success_rate']:.0%}")


if __name__ == "__main__":
    main()

12Discussion Questions

Planning horizon: How far ahead should an agent plan? Is it better to plan the entire task upfront or plan incrementally (one or two steps at a time)? What factors influence this decision? Hint: consider the trade-off between coherence (long-horizon planning keeps the big picture in view) and adaptability (short-horizon planning can react to unexpected results). Think about how certainty degrades the further you look ahead.
Plan faithfulness: When an LLM generates a plan and then executes it, should it follow the plan exactly or adapt freely? What is the right balance between plan adherence and opportunistic adaptation? Hint: this is analogous to the explore-exploit trade-off. Too much adherence leads to rigidity; too much adaptation leads to losing the thread of the original plan.
Planning for humans vs. planning for agents: How should plans generated for agent execution differ from plans generated for human execution? Consider granularity, error handling, and parallelism. Hint: agents need more explicit error handling ("if step 3 fails, try alternative approach X") because they cannot use common sense to recover from unexpected situations.
The planning bottleneck: Planning is often the most error-prone part of an agent pipeline. How might we improve plan quality beyond what current LLMs can produce? Hint: consider plan verification (have another LLM check the plan), plan libraries (reuse plans that worked before), and hybrid approaches (use classical planners for the parts that can be formalized).
Classical meets modern: Could classical planning algorithms (STRIPS, HTN) be effectively combined with LLM-based approaches? What would that look like? Hint: the LLM could generate a PDDL domain description from natural language, or could fill in the details of an HTN plan skeleton.
Planning under resource constraints: How should an agent plan differently when it has a limited budget (e.g., maximum 10 API calls, maximum $1 in costs)? Hint: consider how humans plan when they have limited time: they focus on the most important steps, skip nice-to-haves, and build in less margin for error. How could an agent do the same?

13Summary and Key Takeaways

Planning is fundamental to agency. Without the ability to decompose goals into steps, reason about dependencies, and adapt to failures, agents are limited to reactive, single-step interactions.
Classical planning provides formal guarantees (correctness, completeness, optimality) but requires formal domain specifications that are impractical for many real-world tasks.
LLM-based planning is flexible and general but offers no formal guarantees. Plans may contain errors, miss steps, or hallucinate impossible actions.
Hierarchical decomposition manages complexity by breaking large goals into progressively smaller subtasks, each at an appropriate level of abstraction.
Plan validation and refinement are essential. Self-validation, cross-validation with critic agents, and simulation-based testing all improve plan quality before execution.
Dynamic replanning enables robustness. Agents that can detect failures and adapt their plans are more capable than agents that follow rigid plans.
Task graphs with dependency management enable parallel execution and provide clear structure for complex multi-step tasks.
Advanced approaches like ToT and LATS systematically explore the planning space, trading computational cost for better solutions. LATS's use of reflection is particularly promising.

14References

Fikes, R. E., & Nilsson, N. J. (1971). STRIPS: A New Approach to the Application of Theorem Proving to Problem Solving. Artificial Intelligence, 2(3-4), 189-208.
McDermott, D., Ghallab, M., Howe, A., Knoblock, C., Ram, A., Veloso, M., Weld, D., & Wilkins, D. (1998). PDDL -- The Planning Domain Definition Language. Technical Report CVC TR-98-003, Yale Center for Computational Vision and Control.
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., & Narasimhan, K. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. Advances in Neural Information Processing Systems (NeurIPS), 36.
Zhou, A., Yan, K., Shlapentokh-Rothman, M., Wang, H., & Wang, Y.-X. (2024). Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models. International Conference on Machine Learning (ICML).
Zhou, D., Scharli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q., & Chi, E. (2023). Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. International Conference on Learning Representations (ICLR).
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems (NeurIPS), 35.
Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., & Anandkumar, A. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv preprint arXiv:2305.16291.
Huang, W., Abbeel, P., Pathak, D., & Mordatch, I. (2022). Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. International Conference on Machine Learning (ICML).
Valmeekam, K., Marquez, M., Sreedharan, S., & Kambhampati, S. (2024). On the Planning Abilities of Large Language Models -- A Critical Investigation. Advances in Neural Information Processing Systems (NeurIPS), 36.

Part of "Agentic AI: Foundations, Architectures, and Applications" (CC BY-SA 4.0).