WORK IN PROGRESS
Preamble
This article is aimed at intermediate level LLM users & people who are generally comfortable with what SillyTavern is and how it achieves it's goals.
If you haven't already spent some time with both the software and LLMs in general then certain aspects of this article might be a bit fiddly and hard to follow. I would suggest you get a decent handle on the following topics before tackling this:
1) LLM prompting - what it is, the bare basics of how an LLM interprets the text we send, and how these aspects relate to SillyTavern and how it works.
2) SillyTavern Lorebooks - many of the goals of this article can also be achieved via cunning use of the Lorebook system.
What Does This Article Get Me?
If you follow this article all the way to the end you will have a system that allows you to save an arbitrarily large number of memories against a character card and contextually pull out the most relevant ones every time you send a message to an LLM using SillyTavern.
You will need the following things to make this work:
- SillyTavern (obviously)
- Ollama
- About 1 hour of free time
This article isn't just a step-by-step guide on how to make this work. I am going to try and explain all the concepts we will be using along the way. If you want to skip directly to the tutorial you can, but I find the mechanics behind this stuff absolutely fascinating and I recommend you at least skim read the explainers!
What Is Retrieval Augmented Generation?
Retrieval Augmented Generation (or RAG) is a technique whereby we dynamically provide contextually relevant information about our most recent message to the LLM via our prompt.
Technically, the basic lorebook implementation in SillyTavern is a RAG system, since we are performing a pre-processing step on our message, selecting (via keyword matching) contextually relevant snippets of information from a data bank (our lorebook), and then stuffing that information into our prompt before we send it to the LLM.
Properly set up, this can be hugely powerful. There are many SillyTavern users who successfully use this system to provide their bots with long and short-term memory, bolster the worldbuilding of their settings, and provide depth to the personalities of their characters.
This system has a flaw however, in that in order for the information to be inserted into our prompt, an exact keyword match must be made between the text in the message we send and the keywords attached to each lorebook entry. For example, let's imagine we are chatting with a bot called Kay, and that Kay has a lorebook attached to her character card. This lorebook has the following entry:
Title: Kay's Preferences
Description: Kay likes the programming languages Python, JavaScript, and VB.NET
Keywords: Kay, Python, JavaScript, .NET
Let's say we then send the following message to the bot:
What programming languages do you like?
Even though the message we sent is highly semantically relevant to our lorebook entry, it won't be sent along with the prompt because none of the tokens in our message match the keywords.
You could, of course, get around this problem by adding lots and lots of keywords to your lorebook entries, but this then creates a new problem in that you start to get a high degree of overlap between entries. If you have a bot you've been chatting with for a long time it could end up having many hundreds of lorebook entries as you add more and more memories to it. You could end up wasting thousands of tokens worth of precious context on contextual memories that aren't actually completely related to the message you have just sent.
If only there was a way to scan and retrieve information related to your bot in a way that actually took the semantic meaning of the words in your message into account...
Vectors! Embeddings!
Great news, there is!
The same big-brains who laid the foundations for large language models also invented a way to map out human languages in a way that allows us measure how similar two pieces of text are to each other. This map is called an embeddings model and it's what we will be setting up today.
Explaining how this works is notoriously difficult - you don't really need to understand how it works in order to benefit from it - but just for the sake of being thorough I will try to explain it using the metaphor that allowed me to understand it.
Image you are standing in a massive warehouse.
It's 10 miles long in every direction. Every couple of minutes, I enter the warehouse and I hand you a ping-pong ball. On each ball is a word written in English. Your job is to take each ball in turn, and place them on the floor. I want you to place words that are similar in meaning closer together, and words that are different from each other further apart.
If I give you two balls, "cat" and "dog", you'd place these close together, because cats and dogs are both fluffy, they are both animals, they both have four legs, they are both domesticated species etc.
If I then gave you another ball, "asteroid", you'd place it miles away from the first two, because the word "asteroid" has nothing to do with either "cat" or "dog". If I gave you a fourth ball "tree" you'd place it somewhere in between the two groups we've just mad (but slightly closer to cat/dog than asteroid), because although a tree is a living thing, it's not an animal, and it's not furry.
This principle - of grouping related words together - is more or less what an embeddings model does. Instead of being a massive warehouse full of balls that represent words however, an embeddings model is a massive blob of floating point numbers that represent tokens. The math that allows this to work is also far more complex than our two-dimensional example. Where in our 2D example we are placing each ball on an x axis and a y axis, an embeddings model places the balls in an arbitrary number of dimensions. Some of the larger, more accurate models, can have many thousands of dimensions, with each dimension representing some axis of semantic meaning.
If you train a model like this on a large enough corpus of text what you end up with is an n-dimensional map of all the words and concepts in a given language, and how related they are to each other. This would be neat even if it didn't do anything immediately useful but you can then pass arbitrary text through this model and figure out where it lies semantically on that map!
Let's look at our example again:
What programming languages do you like?
Let's say I gave you a ping-pong ball with this phrase written on it, and you'd already placed a ping-pong ball on the floor for every possible token in the English language. Your job now is to find the most relevant place on the floor to place that phrase. Let's say you spend some amount of time manually doing this, and you decide that the best place for it is at floor coordinate X: -12, Y: 5.
Congratulations, you've just created a vector for the phrase What programming languages do you like?
If we'd previously created a vector for the lorebook example Kay likes the programming languages Python, JavaScript, and VB.NET we'd now have two vectors that, together, contain calculable information on how closely related these two phrases are. Using these vectors there are simple mechanisms by which we can put a score on their similarity - a single number that represents how geometrically close any two phrases are to each to each other in vector space.
If you have a large databank of "vectorised" bot memories, you can vectorise any message you write, and then perform a vector similarity search against the contents of your databank. If you rank everything within it against your input phrase, what you get is a ranked list of the most semantically relevant memories for the current message.
SillyTavern ships with a default plugin that allows us to do all of this. Every time you send a message, it will be vectorised, compared to the contents of your databank, and any snippet within that has a vector similarity score above a configurable threshold will be pulled out. You can take one or more of these top-ranked snippets, stuff them into your prompt right before you send it to the LLM, and you've just magically given your chat bot a proper, working memory mechanism!
So How Do We Do This?
There are three steps to making this work in SillyTavern:
1) Get an embedding model running locally
2) Setting up a system for creating memories
3) Setting up the SillyTavern Vector Storage plugin
I will explain how I use these systems. It's worth noting at this point that there are numerous options here. Just because I do something one way does not mean that is the best way. It's just the way I find it easiest to manage. Once you have this up and running I'd recommend you experiment with these systems and find a setup + workflow that works for you!