Skip to content

Conversation

@donggyun112
Copy link

Please ensure you have read the contribution guide before creating a pull request.

Link to Issue or Description of Change

1. Link to an existing issue (if applicable):

2. Or, if no issue exists, describe the change:

Problem:

When execution is interrupted (e.g., server restart, browser refresh, connection loss) after a function_call but before the function_response is saved, the session becomes permanently unrecoverable. Anthropic and OpenAI APIs require tool_calls to be immediately followed by tool_results, so subsequent requests fail with 400 BadRequest, creating a crash loop.

Solution:

Detect orphaned function_calls (calls without matching responses) during content processing and inject synthetic error responses to gracefully recover the session.

Why this approach:

Two approaches were considered:

Approach Description Pros Cons
1. Separation of Concerns Separate _find_orphaned_function_calls() + _create_synthetic_response_event() functions, called after _rearrange_events_for_async_function_responses_in_history() Clear responsibility separation, easier to test independently, self-documenting code Extra O(N) event iteration, duplicates ID mapping logic already in rearrange function
2. Single-Pass Integration Extend _rearrange_events_for_async_function_responses_in_history() with heal_orphaned_calls param, detect orphaned calls during existing loop Reuses existing function_call_id_to_response_events_index mapping, no duplicate iteration, better performance Slightly increases function complexity, mixed responsibilities

Decision: Chose Approach 2 for the following reasons:

  1. The existing _rearrange_events_for_async_function_responses_in_history() already builds a function_call_id_to_response_events_index mapping - reusing it avoids redundant work
  2. Orphaned call detection is logically part of the "rearrangement" process (pairing calls with responses)
  3. Avoids extra O(N) iteration over events
  4. The heal_orphaned_calls=False default maintains backward compatibility

Testing Plan

Unit Tests:

  • I have added or updated unit tests for my change.
  • All unit tests pass locally.
$ uv run pytest tests/unittests/flows/llm_flows/test_contents_function.py -v
# 12 passed (7 existing + 5 new)

New test cases:

  • test_auto_healing_single_orphaned_function_call - single orphaned call
  • test_auto_healing_multiple_orphaned_function_calls - multiple orphaned calls in one event
  • test_auto_healing_partial_orphaned_function_calls - mix of completed and orphaned calls
  • test_auto_healing_no_healing_when_responses_exist - no false positives
  • test_auto_healing_logs_warning - warning log verification

Manual End-to-End (E2E) Tests:

Reproduced the issue using a test script that sends broken message history (tool_call without tool_result) to Anthropic/OpenAI/Gemini APIs. Before the fix, all non-Gemini models returned 400 BadRequest. After the fix, the synthetic error response allows the session to continue.

Checklist

  • I have read the CONTRIBUTING.md document.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have added tests that prove my fix is effective or that my feature works.
  • New and existing unit tests pass locally with my changes.
  • I have manually tested my changes end-to-end.
  • Any dependent changes have been merged and published in downstream modules.

Additional context

Changes:

File Description
src/google/adk/flows/llm_flows/contents.py Add _create_synthetic_response_for_orphaned_calls() helper, extend _rearrange_events_for_async_function_responses_in_history() with heal_orphaned_calls parameter
tests/unittests/flows/llm_flows/test_contents_function.py Add 5 test cases for auto-healing behavior

Key implementation details:

  1. Synthetic response format: {'error': 'Tool execution was interrupted.'} (follows existing error response pattern in codebase)
  2. Warning log: Auto-healing orphaned function_call (id=..., name=...) for debugging/monitoring
  3. Location: As specified by maintainer, detection occurs around line 445 in _get_contents()

Known Limitations & Future Work:

  1. Synthetic responses not persisted to session: Generated at LLM request time only, not saved to session storage. UI/logs/telemetry may still show orphaned calls as "pending". Future consideration: should synthetic events be persisted? This requires policy decision affecting session history integrity, replay scenarios, and multi-client sync.

  2. Repeated warning logs: logger.warning() emitted each time _get_contents() processes an orphaned call. If session resumes multiple times before progressing, same warning repeats. Future options: persist synthetic responses, deduplicate by call ID, or demote to logger.info() after first occurrence.

These are intentionally left for future PRs to keep this fix focused and minimal.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @donggyun112, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical stability issue where an interrupted LLM session, specifically after a function call without a subsequent response, would lead to a permanent crash loop. By implementing an intelligent auto-healing process, the system now identifies these incomplete interactions and inserts a standardized error response, enabling the session to recover and proceed without manual intervention or API failures. This significantly improves the robustness and user experience of the LLM interaction flow.

Highlights

  • Crash Loop Prevention: This PR introduces an auto-healing mechanism to prevent crash loops that occur when an LLM session is interrupted after a function_call but before its corresponding function_response is saved. This issue previously led to unrecoverable sessions and 400 BadRequest errors from APIs like Anthropic and OpenAI.
  • Synthetic Error Responses: The solution detects these 'orphaned' function calls during content processing and injects synthetic error responses (e.g., {'error': 'Tool execution was interrupted.'}) into the event history. This allows the session to gracefully recover and continue.
  • Efficient Integration: The auto-healing logic is integrated into the existing _rearrange_events_for_async_function_responses_in_history() function, reusing its internal ID mapping and avoiding redundant event iterations for better performance.
  • Comprehensive Testing: Five new unit tests have been added to cover various scenarios, including single, multiple, and partial orphaned function calls, ensuring the healing mechanism works as expected and does not interfere with correctly handled calls. Warning log verification is also included.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@adk-bot adk-bot added the core [Component] This issue is related to the core interface and implementation label Dec 31, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is an excellent pull request that addresses a critical crash loop issue with orphaned function calls. The solution is well-designed, efficiently integrated into the existing event processing logic, and accompanied by a comprehensive set of unit tests that cover various scenarios. I have one high-severity suggestion to ensure the synthetic event's authorship is consistent with the rest of the system, which will prevent potential issues in multi-agent scenarios. Overall, great work on this important fix.


return Event(
invocation_id=event.invocation_id,
author=event.author,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The synthetic event for the orphaned function call response should have 'user' as its author. Currently, it inherits the author from the function call event, which is the agent's name. In the ADK, function response events are consistently authored by the 'user'. Using the agent's name as the author could lead to incorrect processing in multi-agent scenarios, where it might be misinterpreted as a message from another agent. Setting the author to 'user' ensures it's treated as a direct response to the tool call, maintaining consistency with the framework's event structure.

Suggested change
author=event.author,
author='user',

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is an excellent pull request that addresses a critical crash loop issue with a well-designed and efficient solution. The decision to integrate the healing logic into the existing event rearrangement function to avoid redundant iterations is a smart optimization. The code is clear, well-commented, and the addition of comprehensive unit tests ensures the fix is robust and covers various edge cases. The detailed pull request description is also exemplary and was very helpful for the review.


return Event(
invocation_id=event.invocation_id,
author=event.author,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The synthetic event is being created with the author of the original function_call event (an agent), but the content's role is set to 'user'. This is inconsistent. To align with how other function_response events are structured in the codebase (where they typically have author='user'), the author of this synthetic event should also be 'user' to match the content's role.

Suggested change
author=event.author,
author='user',

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust fix to prevent crash loops caused by orphaned function_call events. The solution, which involves injecting synthetic error responses, is well-implemented. The decision to integrate this logic into an existing function for performance reasons is well-justified in the pull request description. The code is clean, and the accompanying unit tests are comprehensive, covering a wide range of scenarios. I have one suggestion regarding the new test code to improve its structure and reduce duplication. Overall, this is a high-quality contribution that effectively addresses the issue.

Comment on lines +610 to +614
agent = Agent(model="gemini-2.5-flash", name="test_agent")
llm_request = LlmRequest(model="gemini-2.5-flash")
invocation_context = await testing_utils.create_invocation_context(
agent=agent
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This setup block is repeated across all the new test_auto_healing_* tests. To improve maintainability and reduce code duplication, consider extracting this common setup into a pytest fixture.

For example, you could define a fixture that yields the agent, llm_request, and invocation_context:

@pytest.fixture
async def healing_test_setup():
    """Provides a common setup for auto-healing tests."""
    agent = Agent(model="gemini-2.5-flash", name="test_agent")
    llm_request = LlmRequest(model="gemini-2.5-flash")
    invocation_context = await testing_utils.create_invocation_context(
        agent=agent
    )
    yield agent, llm_request, invocation_context

Each test could then accept this fixture, making the test suite cleaner and easier to manage:

@pytest.mark.asyncio
async def test_auto_healing_single_orphaned_function_call(healing_test_setup):
    agent, llm_request, invocation_context = healing_test_setup
    # ... rest of the test logic

@ryanaiagent ryanaiagent self-assigned this Jan 1, 2026
@ryanaiagent
Copy link
Collaborator

Hi @donggyun112 , Thank you for your contribution! We appreciate you taking the time to submit this pull request.
Can you please fix the lint errors before we can proceed with the review. You can use autoformat.sh

@ryanaiagent ryanaiagent added the request clarification [Status] The maintainer need clarification or more information from the author label Jan 2, 2026
@donggyun112 donggyun112 force-pushed the fix/orphaned-tool-calls-crash-loop branch from 228c741 to a52aba8 Compare January 2, 2026 06:14
@donggyun112
Copy link
Author

@ryanaiagent Thank you for reviewing my PR. The lint errors have been fixed.

Note: Known limitations (synthetic responses not persisted, repeated warning logs) are documented in the PR description and intentionally left for future PRs to keep this fix focused.

When execution is interrupted (e.g., server restart or connection loss)
after a function_call but before the function_response is saved, the
session becomes unrecoverable because Anthropic/OpenAI require tool_calls
to be immediately followed by tool_results.

This change detects orphaned function_calls and injects synthetic error
responses to gracefully recover the session.

Changes:
- Add _ORPHANED_CALL_ERROR_RESPONSE constant for error responses
- Add _create_synthetic_response_for_orphaned_calls helper function
- Detect orphaned calls in _rearrange_events_for_async_function_responses_in_history
- Use 'user' as author for synthetic function response events
- Add 5 comprehensive test cases for auto-healing behavior

Fixes google#3971
@donggyun112 donggyun112 force-pushed the fix/orphaned-tool-calls-crash-loop branch from a52aba8 to d3ee2b7 Compare January 2, 2026 11:42
@donggyun112
Copy link
Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a high-quality pull request that effectively addresses a critical issue causing crash loops with orphaned function calls. The problem, solution, and chosen approach are all clearly articulated in the description. The implementation is clean and logical, and the addition of comprehensive unit tests covering various scenarios is excellent. I have one suggestion to simplify the code for better maintainability, but overall, this is a solid contribution.

Comment on lines +149 to +158
if len(function_response_events_indices) == 1:
result_events.append(
events[next(iter(function_response_events_indices))]
)
else: # Merge all async function_response as one response event
result_events.append(
_merge_function_response_events(
[events[i] for i in sorted(function_response_events_indices)]
)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _merge_function_response_events function is capable of handling a list with a single event, which makes the special case for len(function_response_events_indices) == 1 redundant. You can simplify this block by removing the if/else and just using the logic from the else branch for all cases where function_response_events_indices is not empty. This improves code maintainability by reducing complexity.

Suggested change
if len(function_response_events_indices) == 1:
result_events.append(
events[next(iter(function_response_events_indices))]
)
else: # Merge all async function_response as one response event
result_events.append(
_merge_function_response_events(
[events[i] for i in sorted(function_response_events_indices)]
)
)
result_events.append(
_merge_function_response_events(
[events[i] for i in sorted(function_response_events_indices)]
)
)

Long-running tools (e.g., human-in-the-loop) intentionally don't produce
immediate function_response events. They should not be treated as orphaned.

Changes:
- Check event.long_running_tool_ids before marking a call as orphaned
- Add tests for long-running tool exclusion
@donggyun112
Copy link
Author

Overview

Fixed the crash loop issue (#3971) where missing function_response events were causing API errors.

Key Highlights

  • The Fix: We now detect orphaned function calls and inject a synthetic "interrupted" response. It’s a simple way to satisfy the API requirements without breaking the conversation flow.
  • Handling Long-running Tools: Added a check for long_running_tool_ids so we don't interfere with tools that are meant to stay open.
  • Refactoring: Cleaned up some pyink formatting in the test files while I was at it (b3fb4c19).

Testing & Limitations

The fix is covered by 14 tests now, including edge cases for partial healing.

One thing to note: these synthetic responses aren't saved to the session database yet—they're just injected during the LLM request. I figured it's better to keep this PR lean and address persistence in the next step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core [Component] This issue is related to the core interface and implementation request clarification [Status] The maintainer need clarification or more information from the author

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Persistent crash loop caused by missing tool_result in conversation history after interrupted execution

3 participants