fix: auto-heal orphaned function_calls to prevent crash loop #4056

donggyun112 · 2025-12-31T13:55:03Z

Please ensure you have read the contribution guide before creating a pull request.

Link to Issue or Description of Change

1. Link to an existing issue (if applicable):

Closes: Persistent crash loop caused by missing tool_result in conversation history after interrupted execution #3971

2. Or, if no issue exists, describe the change:

Problem:

When execution is interrupted (e.g., server restart, browser refresh, connection loss) after a function_call but before the function_response is saved, the session becomes permanently unrecoverable. Anthropic and OpenAI APIs require tool_calls to be immediately followed by tool_results, so subsequent requests fail with 400 BadRequest, creating a crash loop.

Solution:

Detect orphaned function_calls (calls without matching responses) during content processing and inject synthetic error responses to gracefully recover the session.

Why this approach:

Two approaches were considered:

Approach	Description	Pros	Cons
1. Separation of Concerns	Separate `_find_orphaned_function_calls()` + `_create_synthetic_response_event()` functions, called after `_rearrange_events_for_async_function_responses_in_history()`	Clear responsibility separation, easier to test independently, self-documenting code	Extra O(N) event iteration, duplicates ID mapping logic already in rearrange function
2. Single-Pass Integration ✅	Extend `_rearrange_events_for_async_function_responses_in_history()` with `heal_orphaned_calls` param, detect orphaned calls during existing loop	Reuses existing `function_call_id_to_response_events_index` mapping, no duplicate iteration, better performance	Slightly increases function complexity, mixed responsibilities

Decision: Chose Approach 2 for the following reasons:

The existing _rearrange_events_for_async_function_responses_in_history() already builds a function_call_id_to_response_events_index mapping - reusing it avoids redundant work
Orphaned call detection is logically part of the "rearrangement" process (pairing calls with responses)
Avoids extra O(N) iteration over events
The heal_orphaned_calls=False default maintains backward compatibility

Testing Plan

Unit Tests:

I have added or updated unit tests for my change.
All unit tests pass locally.

$ uv run pytest tests/unittests/flows/llm_flows/test_contents_function.py -v
# 12 passed (7 existing + 5 new)

New test cases:

test_auto_healing_single_orphaned_function_call - single orphaned call
test_auto_healing_multiple_orphaned_function_calls - multiple orphaned calls in one event
test_auto_healing_partial_orphaned_function_calls - mix of completed and orphaned calls
test_auto_healing_no_healing_when_responses_exist - no false positives
test_auto_healing_logs_warning - warning log verification

Manual End-to-End (E2E) Tests:

Reproduced the issue using a test script that sends broken message history (tool_call without tool_result) to Anthropic/OpenAI/Gemini APIs. Before the fix, all non-Gemini models returned 400 BadRequest. After the fix, the synthetic error response allows the session to continue.

Checklist

I have read the CONTRIBUTING.md document.
I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have added tests that prove my fix is effective or that my feature works.
New and existing unit tests pass locally with my changes.
I have manually tested my changes end-to-end.
Any dependent changes have been merged and published in downstream modules.

Additional context

Changes:

File	Description
`src/google/adk/flows/llm_flows/contents.py`	Add `_create_synthetic_response_for_orphaned_calls()` helper, extend `_rearrange_events_for_async_function_responses_in_history()` with `heal_orphaned_calls` parameter
`tests/unittests/flows/llm_flows/test_contents_function.py`	Add 5 test cases for auto-healing behavior

Key implementation details:

Synthetic response format: {'error': 'Tool execution was interrupted.'} (follows existing error response pattern in codebase)
Warning log: Auto-healing orphaned function_call (id=..., name=...) for debugging/monitoring
Location: As specified by maintainer, detection occurs around line 445 in _get_contents()

Known Limitations & Future Work:

Synthetic responses not persisted to session: Generated at LLM request time only, not saved to session storage. UI/logs/telemetry may still show orphaned calls as "pending". Future consideration: should synthetic events be persisted? This requires policy decision affecting session history integrity, replay scenarios, and multi-client sync.
Repeated warning logs: logger.warning() emitted each time _get_contents() processes an orphaned call. If session resumes multiple times before progressing, same warning repeats. Future options: persist synthetic responses, deduplicate by call ID, or demote to logger.info() after first occurrence.

These are intentionally left for future PRs to keep this fix focused and minimal.

gemini-code-assist · 2025-12-31T13:55:20Z

Summary of Changes

Hello @donggyun112, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical stability issue where an interrupted LLM session, specifically after a function call without a subsequent response, would lead to a permanent crash loop. By implementing an intelligent auto-healing process, the system now identifies these incomplete interactions and inserts a standardized error response, enabling the session to recover and proceed without manual intervention or API failures. This significantly improves the robustness and user experience of the LLM interaction flow.

Highlights

Crash Loop Prevention: This PR introduces an auto-healing mechanism to prevent crash loops that occur when an LLM session is interrupted after a function_call but before its corresponding function_response is saved. This issue previously led to unrecoverable sessions and 400 BadRequest errors from APIs like Anthropic and OpenAI.
Synthetic Error Responses: The solution detects these 'orphaned' function calls during content processing and injects synthetic error responses (e.g., {'error': 'Tool execution was interrupted.'}) into the event history. This allows the session to gracefully recover and continue.
Efficient Integration: The auto-healing logic is integrated into the existing _rearrange_events_for_async_function_responses_in_history() function, reusing its internal ID mapping and avoiding redundant event iterations for better performance.
Comprehensive Testing: Five new unit tests have been added to cover various scenarios, including single, multiple, and partial orphaned function calls, ensuring the healing mechanism works as expected and does not interfere with correctly handled calls. Warning log verification is also included.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This is an excellent pull request that addresses a critical crash loop issue with orphaned function calls. The solution is well-designed, efficiently integrated into the existing event processing logic, and accompanied by a comprehensive set of unit tests that cover various scenarios. I have one high-severity suggestion to ensure the synthetic event's authorship is consistent with the rest of the system, which will prevent potential issues in multi-agent scenarios. Overall, great work on this important fix.

gemini-code-assist · 2025-12-31T13:56:21Z

src/google/adk/flows/llm_flows/contents.py

+
+  return Event(
+      invocation_id=event.invocation_id,
+      author=event.author,


The synthetic event for the orphaned function call response should have 'user' as its author. Currently, it inherits the author from the function call event, which is the agent's name. In the ADK, function response events are consistently authored by the 'user'. Using the agent's name as the author could lead to incorrect processing in multi-agent scenarios, where it might be misinterpreted as a message from another agent. Setting the author to 'user' ensures it's treated as a direct response to the tool call, maintaining consistency with the framework's event structure.

Suggested change

author=event.author,

author='user',

gemini-code-assist

Code Review

This is an excellent pull request that addresses a critical crash loop issue with a well-designed and efficient solution. The decision to integrate the healing logic into the existing event rearrangement function to avoid redundant iterations is a smart optimization. The code is clear, well-commented, and the addition of comprehensive unit tests ensures the fix is robust and covers various edge cases. The detailed pull request description is also exemplary and was very helpful for the review.

gemini-code-assist · 2025-12-31T13:57:16Z

src/google/adk/flows/llm_flows/contents.py

+
+  return Event(
+      invocation_id=event.invocation_id,
+      author=event.author,


The synthetic event is being created with the author of the original function_call event (an agent), but the content's role is set to 'user'. This is inconsistent. To align with how other function_response events are structured in the codebase (where they typically have author='user'), the author of this synthetic event should also be 'user' to match the content's role.

Suggested change

author=event.author,

author='user',

gemini-code-assist

Code Review

This pull request introduces a robust fix to prevent crash loops caused by orphaned function_call events. The solution, which involves injecting synthetic error responses, is well-implemented. The decision to integrate this logic into an existing function for performance reasons is well-justified in the pull request description. The code is clean, and the accompanying unit tests are comprehensive, covering a wide range of scenarios. I have one suggestion regarding the new test code to improve its structure and reduce duplication. Overall, this is a high-quality contribution that effectively addresses the issue.

gemini-code-assist · 2025-12-31T14:01:11Z

tests/unittests/flows/llm_flows/test_contents_function.py

+  agent = Agent(model="gemini-2.5-flash", name="test_agent")
+  llm_request = LlmRequest(model="gemini-2.5-flash")
+  invocation_context = await testing_utils.create_invocation_context(
+      agent=agent
+  )


This setup block is repeated across all the new test_auto_healing_* tests. To improve maintainability and reduce code duplication, consider extracting this common setup into a pytest fixture.

For example, you could define a fixture that yields the agent, llm_request, and invocation_context:

@pytest.fixture async def healing_test_setup(): """Provides a common setup for auto-healing tests.""" agent = Agent(model="gemini-2.5-flash", name="test_agent") llm_request = LlmRequest(model="gemini-2.5-flash") invocation_context = await testing_utils.create_invocation_context( agent=agent ) yield agent, llm_request, invocation_context

Each test could then accept this fixture, making the test suite cleaner and easier to manage:

@pytest.mark.asyncio async def test_auto_healing_single_orphaned_function_call(healing_test_setup): agent, llm_request, invocation_context = healing_test_setup # ... rest of the test logic

ryanaiagent · 2026-01-02T01:13:04Z

Hi @donggyun112 , Thank you for your contribution! We appreciate you taking the time to submit this pull request.
Can you please fix the lint errors before we can proceed with the review. You can use autoformat.sh

donggyun112 · 2026-01-02T07:28:41Z

@ryanaiagent Thank you for reviewing my PR. The lint errors have been fixed.

Note: Known limitations (synthetic responses not persisted, repeated warning logs) are documented in the PR description and intentionally left for future PRs to keep this fix focused.

When execution is interrupted (e.g., server restart or connection loss) after a function_call but before the function_response is saved, the session becomes unrecoverable because Anthropic/OpenAI require tool_calls to be immediately followed by tool_results. This change detects orphaned function_calls and injects synthetic error responses to gracefully recover the session. Changes: - Add _ORPHANED_CALL_ERROR_RESPONSE constant for error responses - Add _create_synthetic_response_for_orphaned_calls helper function - Detect orphaned calls in _rearrange_events_for_async_function_responses_in_history - Use 'user' as author for synthetic function response events - Add 5 comprehensive test cases for auto-healing behavior Fixes google#3971

donggyun112 · 2026-01-02T11:43:49Z

/gemini review

gemini-code-assist

Code Review

This is a high-quality pull request that effectively addresses a critical issue causing crash loops with orphaned function calls. The problem, solution, and chosen approach are all clearly articulated in the description. The implementation is clean and logical, and the addition of comprehensive unit tests covering various scenarios is excellent. I have one suggestion to simplify the code for better maintainability, but overall, this is a solid contribution.

gemini-code-assist · 2026-01-02T11:46:15Z

src/google/adk/flows/llm_flows/contents.py

+        if len(function_response_events_indices) == 1:
+          result_events.append(
+              events[next(iter(function_response_events_indices))]
+          )
+        else:  # Merge all async function_response as one response event
+          result_events.append(
+              _merge_function_response_events(
+                  [events[i] for i in sorted(function_response_events_indices)]
+              )
+          )


The _merge_function_response_events function is capable of handling a list with a single event, which makes the special case for len(function_response_events_indices) == 1 redundant. You can simplify this block by removing the if/else and just using the logic from the else branch for all cases where function_response_events_indices is not empty. This improves code maintainability by reducing complexity.

Suggested change

if len(function_response_events_indices) == 1:

result_events.append(

events[next(iter(function_response_events_indices))]

)

else: # Merge all async function_response as one response event

result_events.append(

_merge_function_response_events(

[events[i] for i in sorted(function_response_events_indices)]

)

)

result_events.append(

_merge_function_response_events(

[events[i] for i in sorted(function_response_events_indices)]

)

)

Long-running tools (e.g., human-in-the-loop) intentionally don't produce immediate function_response events. They should not be treated as orphaned. Changes: - Check event.long_running_tool_ids before marking a call as orphaned - Add tests for long-running tool exclusion

donggyun112 · 2026-01-02T12:36:37Z

Overview

Fixed the crash loop issue (#3971) where missing function_response events were causing API errors.

Key Highlights

The Fix: We now detect orphaned function calls and inject a synthetic "interrupted" response. It’s a simple way to satisfy the API requirements without breaking the conversation flow.
Handling Long-running Tools: Added a check for long_running_tool_ids so we don't interfere with tools that are meant to stay open.
Refactoring: Cleaned up some pyink formatting in the test files while I was at it (b3fb4c19).

Testing & Limitations

The fix is covered by 14 tests now, including edge cases for partial healing.

One thing to note: these synthetic responses aren't saved to the session database yet—they're just injected during the LLM request. I figured it's better to keep this PR lean and address persistence in the next step.

adk-bot added the core [Component] This issue is related to the core interface and implementation label Dec 31, 2025

gemini-code-assist bot reviewed Dec 31, 2025

View reviewed changes

ryanaiagent self-assigned this Jan 1, 2026

ryanaiagent added the request clarification [Status] The maintainer need clarification or more information from the author label Jan 2, 2026

donggyun112 force-pushed the fix/orphaned-tool-calls-crash-loop branch from 228c741 to a52aba8 Compare January 2, 2026 06:14

donggyun112 force-pushed the fix/orphaned-tool-calls-crash-loop branch from a52aba8 to d3ee2b7 Compare January 2, 2026 11:42

gemini-code-assist bot reviewed Jan 2, 2026

View reviewed changes

donggyun112 added 2 commits January 2, 2026 20:54

style: fix pyink formatting in test file

b3fb4c1

fix: auto-heal orphaned function_calls to prevent crash loop #4056

Are you sure you want to change the base?

fix: auto-heal orphaned function_calls to prevent crash loop #4056

Conversation

donggyun112 commented Dec 31, 2025

Link to Issue or Description of Change

Testing Plan

Checklist

Additional context

Uh oh!

gemini-code-assist bot commented Dec 31, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

ryanaiagent commented Jan 2, 2026

Uh oh!

donggyun112 commented Jan 2, 2026

Uh oh!

donggyun112 commented Jan 2, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

donggyun112 commented Jan 2, 2026

Overview

Key Highlights

Testing & Limitations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants