Managing Chat Memory in Quarkus Langchain4j

Leave a comment

When I first starting using Quarkus Langchain4j I ran into some issues because I didn’t fully understand how Quarkus managed chat memory with Langchain4j. So, here’s what I’m gonna discuss:

  • How CDI bean scopes of AI Services effect chat memory
  • How chat memory is managed when @MemoryId is used as a parameter in Quarkus
  • How default chat memory id works in Quarkus
  • How chat memory can be leaked in some scenarios
  • How to write your own Default Memory Id Provider
  • How chat memory can effect your application

Chat Memory in Quarkus Langchain4j

Chat Memory is a history of the conversation you’ve had with an LLM. When using Quarkus Langchain4j (or just Langchain4j too) this chat history is automatically sent to the LLM when you interact with it. Think of chat memory as a chat session. Chat memory is a list of messages sent to and from the LLM. Each chat history is referenced by a unique ID and stored in a chat store. How the chat store stores memory depends on how you’ve built your app. Could be stored in memory or in a database for instance.

@MemoryId

With Quarkus Langchain4j you either use the @MemoryId annotation on a parameter to identify the chat history to use, or you let Quarkus provide this identifier by default. Let’s look at @MemoryId first:

@RegisterAiService
public interface MyChat {
     @SystemMessage("You are a nice assistant")
     String chat(@UserMessage msg, @MemoryId id);
}

@RegisterAiService
public interface AnotherChat {
     @SystemMessage("You are a mean assistant")
     String chat(@UserMessage msg, @MemoryId id);
}

With @MemoryId, the application developer is providing the chat memory identifier to use. The chat history is a concatenation of any other AiService that used the same memory ID. For example

@Inject
MyChat myChat;

@Inject 
AnotherChat another;

public void call() {
    String id = "1234";
    String my = myChat.chat("Hello!", id);
    String another = another.chat("GoodBye", id);
}

There’s a couple of things to think about when sharing a @MemoryId between different AiServices (prompts).

Shared Chat History

With the call to AnotherChat.chat() in line #10, the chat history of the previous call in line #9 is also included because the same memory id is passed to both function calls.

Only 1 SystemMessage per history

Another thing about running this code is that the original SystemMessage from MyChat is removed from chat history and a new SystemMessage from AnotherChat is added. Only one SystemMessage is allowed per history.

Self Management of ID

The application developer is responsible for creating and managing the @MemoryId. You have to ensure that id is unique (easily done with something like a UUID), otherwise different chat sessions could corrupt the other. If chatting is a string of REST calls, then you’ll have to make sure the client is passing along this memory id between HTTP invocations.

Sometimes LLMs are sensitive to what is in chat history. In the case above, the chat history has a mix of chat messages from two different prompts. It also loses the context of MyChat in line #10 as the MyChat system message is removed. Usually not a big deal, but every once in a while you might see your LLM get confused.

Default Memory Id

If a @MemoryId is not specified, then Quarkus Langchain4j decides what the memory id is.

package com.acme;

@RegisterAiService
public interface MyChat {
    String chat(@UserMessage msg);
}

In vanilla, standalone Langchain4j, the default memory id is “default“. If you’re using langchain4j on its own, then you should not use default memory ids in multi-user/multi-session applications as chat history will be completely corrupted.

Quarkus Langchain4j does something different. A unique id is provided per CDI request scope. Request scope being the HTTP invocation, Kafka invocation, etc. Also, the interface and method name of the ai service is tacked on the end of this string. A “#” character is in the middle of the request id and the interface and method name. In other words, the format of the default memory id is:

<random-per-request-id>#<full qualified interface name>.<method-name>

So, for the above Java code, the default memory id for MyChat.chat would be:

@2342351#com.acme.MyChat.chat

There is a couple of things to think about with this default Quarkus implementation

Default Memory Id is tied to the request scope

Since the default id is generated as a unique id tied to the request scope, when your HTTP invocation finishes, the next time you invoke a ai service, a different default memory id will be used and thus you’ll have a completely new chat history.

Different chat history per AI Service method

Since the default id incorporates the ai service interface and method name, then there is a different chat history per ai service method and unlike the example in the @MemoryId section, chat history is not shared between prompts.

Using the Websocket extension gives you per session chat histories

If you use the websocket integration to implement your chat, then the default id is instead unique per session instead of per request. This means that default memory id is retained and meaningful for the entire chat session and you’ll retain chat history in between remote chat requests. The ai service interface name and method is still appended to the default memory id though!

Default memory ids vs. using @MessageId

So what should you use? Default memory ids or @MessageId? If you have a remote chat app where user interactions are in-between remote requests (i.e. HTTP/REST), then you should only use default memory ids for prompts that don’t want or need a complete chat history. In other words, only use default ids if the prompt doesn’t need chat memory. If you need a chat history in between remote requests, then you’ll need to use @MemoryId and manage ids for yourself.

The Websocket extension flips this. When using the WebSocket extension, since the default memory id is generated per websocket connection, you can have a real session and default memory ids are wonderful as you don’t have to manage memory ids in your application.

Memory Lifecycle tied to CDI bean scope

Ai services in Quarkus Langchain4j are CDI beans. If you do not specify a scope for this bean, it defaults to the @RequestScope. What a bean goes out of scope and is destroy an interesting thing happens. Any memory id referenced by the bean is wiped from the chat memory store and is gone forever. ANY memory id: default memory id or any id provided by @MemoryId parameters.

@RegisterAiService
@ApplicationScoped
public interface AppChat {
      String chat(@UserMessage msg, @MemoryId id);
}

@RegisterAiService
@ApplicationScoped
public interface SessionChat {
      String chat(@UserMessage msg, @MemoryId id);
}

@RegisterAiService
@RequestScoped
public interface RequestChat {
     String chat(@UserMessage msg, @MemoryId id);
}

So, for the above code, any memory referenced by the id parameter of RequestChat.chat() will be wiped at the end of the request scope (i.e. the HTTP request). For SessionChat, when the CDI session is destroy, and AppChat when the application shuts down.

Memory tied to the smallest scope used

So, what if within the same rest invocation, you use the same memory id with all three of the ai services above?

@Inject AppChat app;
@Inject RequestChat req;

@GET
public String restCall() {
     String memoryId = "1234";
     app.chat("hello", memoryId);
     req.chat("goodbye", memoryId);
}

So, in the restCall() method, even though AppChat is application scoped, since RequestChat uses the same memory id, “1234″, the chat history will be wiped from the chat memory store at the end of the REST request.

Default memory id can cause a leak

If you are relying on default memory ids and your ai service has a scope other than @RequestScoped, then you will leak chat memory and it will grow to the constraints of the memory store. For example

@ApplicationScoped
@RegisterAiService
public interface AppChat {
     String chat(@UserMessage msg);
}

Since Quarkus’s default memory id is generated for the current request scope each and every time AppChat.chat() is called within a different request scope. Chat memory entries in the chat memory store will grow until the application shuts down.

Never use @ApplicationScoped with default ids

So, the moral of the story is never used @ApplicationScoped with your ai services if you’re relying on default ids. If you are using the websocket extension, then you can use @SessionScoped, but otherwise make sure your ai services are @RequestScoped.

What bean scopes should you use?

For REST-based chat applications:

  • use the combination @ApplicationScoped and @MemoryId parameters to provide a chat history in between requests
  • Use @RequestScoped and default memory ids for prompts that don’t need a chat history
  • Do not share the same memory ids between @ApplicationScoped and @RequestScoped ai services
  • If using the Websocket extension, then use @SessionScoped on your ai services that require a chat history.

Chat Memory and your LLM

So, hopefully you understand how chat memory works with Quarkus Langchain4j now. Just remember:

  • Chat history is sent to your LLM with each request.
  • Limiting chat history can speed up LLM interactions and cost you less money!
  • Limiting chat history can focus your LLM.

All discussions for another blog! Cheers.

LangChain4j: Using IMMEDIATE with tools is great performance boost

Leave a comment

TLDR;

Reduce callbacks to your LLM by immediately returning from tool invocations. This can greatly improve performance, give you fine grain control of your tool invocations, and even save you money too. Check out the docs for more information, or read on.

Intro

In my Baldur’s Forge chat application I talked about in my last blog, I found I had a number of cases where I just wanted the LLM to understand a chat user message and route it to the appropriate tool method. I didn’t care what the LLM’s response was after a tool was invoked. I had a few different scenarios:

  • I wanted the LLM to just execute a task and return and not analyze any response
  • The tool itself might use my chat application’s architecture to take over rendering a response to the client and I didn’t care or want what the LLM’s response to the tool invocation
  • I was just using the LLM to route the user to a different prompt chat conversation.

When you have one of those scenarios the interaction with the LLM can be quite slow, why? What’s happening? This is the flow of tool invocations

  1. LLM client sends the user message, chat history, and a json document describing what tools are available to the LLM.
  2. The LLM interprets the user message, chat history, and the tool description json and decides what to do. If the LLM thinks that it needs to invoke a tool, it responds to the client with a list of tools that should be invoked and the parameters to pass to the tool.
  3. The LLM client sees that the LLM wants to invoke a set of tools. It invokes them, then sends another message back to the LLM with the chat history and a json document describing the tool responses.
  4. The LLM looks at the tool responses and chat history, then decides whether to respond with more tool requests, or, answer the chat

If you don’t care what the LLM’s response is after invoking a tool, steps #3 and #4 can be quite expensive both in latency and, if you’re using Open AI or a commercial LLM, it will cost you more money too. The LLM has to do a lot within Step #3. It has to understand the entire chat history as well as the tool responses that the client sent it. This can be quite time consuming, and even with my little app, would add seconds to one interaction depending how busy Open AI was that day.

@Tool(returnBehavior = ReturnBehavior.IMMEDIATE)

The @Tool annotation attribute returnBehavior can help you out with this. Let’s elaborate on the LangChain4j IMMEDIATE docs say, with a Quarkus LangChain4j spin.

@ApplicationScoped
public class CalculatorWithImmediateReturn {
    @Tool(returnBehavior = ReturnBehavior.IMMEDIATE)
    double add(int a, int b) {
        return a + b;
    }
}

@RegisterAiService
interface Assistant {
    @ToolBox({CalculatorWithImmediateReturn.class})
    Result<String> chat(String userMessage);
}

If and only if the call to Assistant.chat() triggers only IMMEDIATE tool calls, the chat() method will return immediately after one or more tool methods are invoked. If there are any tool calls that are not IMMEDIATE then all the tool responses triggered will be sent back to the LLM for processing (Steps #3 and #4).

Here’s an example of calling our chat service:

Result<String> result = assistant.chat("What is the value of 11 + 31?");

if (result.content() != null) { // There were no tool calls.  LLM may want more info from user
    System.out.println(result.content);
    System.exit(0);
}

double val = (Double)result.toolExecutions().get(0).resultObject();

System.out.println("The return value from the Java tool method was: " + val);

String jsonVal = result.toolExecutions().get(0).result();

System.out.println("The json representation of the Java tool method response was: " + jsonVal);

If no tool calls were invoked, or, the LLM decided to invoke a tool that wasn’t IMMEDIATE, the full tool chat exchange would occur (Steps #3 and #4) and result.content() will return a non-null value. If result.content() is null, then we know that there may have been tool invocations.

The result.toolExecutions() method returns list of tool response objects where you can get the tool name, the Java result from the tool method call, and the tool response as a string json representation.

If you want to see a more concrete example of using this feature, check out the class MainMenuChatFrame from my demo chat application. This class invokes an AiService who’s tools are all IMMEDIATE. Notice that in some instances, IMMEDIATE gives me really fine grain control over things and tools can do things like asking chat memory to be cleared.

Here’s another example in EquipmentBuilder. This prompt has a mixture of IMMEDIATE and traditional tool methods that can be called. Building a piece of equipment required full interaction with the LLM, while informational queries triggered tools that returned immediately.

NOTE: Quarkus Langchain4j will not support this feature of LangChain4j until the next release.