TLDR;

Reduce callbacks to your LLM by immediately returning from tool invocations. This can greatly improve performance, give you fine grain control of your tool invocations, and even save you money too. Check out the docs for more information, or read on.

Intro

In my Baldur’s Forge chat application I talked about in my last blog, I found I had a number of cases where I just wanted the LLM to understand a chat user message and route it to the appropriate tool method. I didn’t care what the LLM’s response was after a tool was invoked. I had a few different scenarios:

  • I wanted the LLM to just execute a task and return and not analyze any response
  • The tool itself might use my chat application’s architecture to take over rendering a response to the client and I didn’t care or want what the LLM’s response to the tool invocation
  • I was just using the LLM to route the user to a different prompt chat conversation.

When you have one of those scenarios the interaction with the LLM can be quite slow, why? What’s happening? This is the flow of tool invocations

  1. LLM client sends the user message, chat history, and a json document describing what tools are available to the LLM.
  2. The LLM interprets the user message, chat history, and the tool description json and decides what to do. If the LLM thinks that it needs to invoke a tool, it responds to the client with a list of tools that should be invoked and the parameters to pass to the tool.
  3. The LLM client sees that the LLM wants to invoke a set of tools. It invokes them, then sends another message back to the LLM with the chat history and a json document describing the tool responses.
  4. The LLM looks at the tool responses and chat history, then decides whether to respond with more tool requests, or, answer the chat

If you don’t care what the LLM’s response is after invoking a tool, steps #3 and #4 can be quite expensive both in latency and, if you’re using Open AI or a commercial LLM, it will cost you more money too. The LLM has to do a lot within Step #3. It has to understand the entire chat history as well as the tool responses that the client sent it. This can be quite time consuming, and even with my little app, would add seconds to one interaction depending how busy Open AI was that day.

@Tool(returnBehavior = ReturnBehavior.IMMEDIATE)

The @Tool annotation attribute returnBehavior can help you out with this. Let’s elaborate on the LangChain4j IMMEDIATE docs say, with a Quarkus LangChain4j spin.

@ApplicationScoped
public class CalculatorWithImmediateReturn {
    @Tool(returnBehavior = ReturnBehavior.IMMEDIATE)
    double add(int a, int b) {
        return a + b;
    }
}

@RegisterAiService
interface Assistant {
    @ToolBox({CalculatorWithImmediateReturn.class})
    Result<String> chat(String userMessage);
}

If and only if the call to Assistant.chat() triggers only IMMEDIATE tool calls, the chat() method will return immediately after one or more tool methods are invoked. If there are any tool calls that are not IMMEDIATE then all the tool responses triggered will be sent back to the LLM for processing (Steps #3 and #4).

Here’s an example of calling our chat service:

Result<String> result = assistant.chat("What is the value of 11 + 31?");

if (result.content() != null) { // There were no tool calls.  LLM may want more info from user
    System.out.println(result.content);
    System.exit(0);
}

double val = (Double)result.toolExecutions().get(0).resultObject();

System.out.println("The return value from the Java tool method was: " + val);

String jsonVal = result.toolExecutions().get(0).result();

System.out.println("The json representation of the Java tool method response was: " + jsonVal);

If no tool calls were invoked, or, the LLM decided to invoke a tool that wasn’t IMMEDIATE, the full tool chat exchange would occur (Steps #3 and #4) and result.content() will return a non-null value. If result.content() is null, then we know that there may have been tool invocations.

The result.toolExecutions() method returns list of tool response objects where you can get the tool name, the Java result from the tool method call, and the tool response as a string json representation.

If you want to see a more concrete example of using this feature, check out the class MainMenuChatFrame from my demo chat application. This class invokes an AiService who’s tools are all IMMEDIATE. Notice that in some instances, IMMEDIATE gives me really fine grain control over things and tools can do things like asking chat memory to be cleared.

Here’s another example in EquipmentBuilder. This prompt has a mixture of IMMEDIATE and traditional tool methods that can be called. Building a piece of equipment required full interaction with the LLM, while informational queries triggered tools that returned immediately.

NOTE: Quarkus Langchain4j will not support this feature of LangChain4j until the next release.