Add a prompt_prefix for customizing LLM chat bot output

2023-07-15 22:35:31 +02:00 · 2023-07-15 22:35:31 +02:00 · 20273ec761
commit 20273ec761
parent 2f98d81d23
4 changed files with 85 additions and 34 deletions
--- a/docs/source/Contribs/Contrib-Llm.md
+++ b/docs/source/Contribs/Contrib-Llm.md
@ -31,7 +31,7 @@ There are many LLM servers, but they can be pretty technical to install and set
 4. Once all is loaded, stop the server with `Ctrl-C` (or `Cmd-C`) and open the file `webui.py` (it's one of the top files in the archive you unzipped). Find the text string `CMD_FLAGS = ''` near the top and change this to `CMD_FLAGS = '--api'`. Then save and close. This makes the server activate its api automatically.
 4. Now just run that server starting script (`start_linux.sh` etc) again. This is what you'll use to start the LLM server henceforth.
 5. Once the server is running, point your browser to http://127.0.0.1:7860 to see the running Text generation web ui running. If you turned on the API, you'll find it's now active on port 5000. This should not collide with default Evennia ports unless you changed something.
-6. At this point you have the server and API, but it's not actually running any Large-Language-Model (LLM) yet. In the web ui, go to the `models` tab and enter a github-style path in the `Download custom model or LoRA` field.  To test so things work, enter `facebook/opt-125m` and download. This is a relatively small model (125 million parameters) so should be possible to run on most machines using only CPU. Update the models in the drop-down on the left and select it, then load it with the `Transformers` loader. It should load pretty quickly. If you want to load this every time, you can select the `Autoload the model` checkbox; otherwise you'll need to select and load the model every time you start the LLM server.
+6. At this point you have the server and API, but it's not actually running any Large-Language-Model (LLM) yet. In the web ui, go to the `models` tab and enter a github-style path in the `Download custom model or LoRA` field.  To test so things work, enter `DeepPavlov/bart-base-en-persona-chat` and download. This is a relatively small model (350 million parameters) so should be possible to run on most machines using only CPU. Update the models in the drop-down on the left and select it, then load it with the `Transformers` loader. It should load pretty quickly. If you want to load this every time, you can select the `Autoload the model` checkbox; otherwise you'll need to select and load the model every time you start the LLM server.
 7. To experiment, you can find thousands of other open-source text-generation LLM models on [huggingface.co/models](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending). Beware to not download a too huge model; your machine may not be able to load it! If you try large models, _don't_ set the `Autoload the model` checkbox, in case the model crashes your server on startup.

 For troubleshooting, you can look at the terminal output of the `text-generation-webui` server; it will show you the requests you do to it and also list any errors. See the text-generation-webui homepage for more details.
@ -119,6 +119,8 @@ This is a simple Character class, with a few extra properties:

 ```python
    # response template on msg_contents form.
+    prompt_prefix = ("You will chat and roleplay ")
+
    response_template = "$You() $conj(say) (to $You(character)): {response}"
    thinking_timeout = 2    # how long to wait until showing thinking

--- a/evennia/contrib/rpg/llm/README.md
+++ b/evennia/contrib/rpg/llm/README.md
@ -31,7 +31,7 @@ There are many LLM servers, but they can be pretty technical to install and set
 4. Once all is loaded, stop the server with `Ctrl-C` (or `Cmd-C`) and open the file `webui.py` (it's one of the top files in the archive you unzipped). Find the text string `CMD_FLAGS = ''` near the top and change this to `CMD_FLAGS = '--api'`. Then save and close. This makes the server activate its api automatically.
 4. Now just run that server starting script (`start_linux.sh` etc) again. This is what you'll use to start the LLM server henceforth.
 5. Once the server is running, point your browser to http://127.0.0.1:7860 to see the running Text generation web ui running. If you turned on the API, you'll find it's now active on port 5000. This should not collide with default Evennia ports unless you changed something.
-6. At this point you have the server and API, but it's not actually running any Large-Language-Model (LLM) yet. In the web ui, go to the `models` tab and enter a github-style path in the `Download custom model or LoRA` field.  To test so things work, enter `facebook/opt-125m` and download. This is a relatively small model (125 million parameters) so should be possible to run on most machines using only CPU. Update the models in the drop-down on the left and select it, then load it with the `Transformers` loader. It should load pretty quickly. If you want to load this every time, you can select the `Autoload the model` checkbox; otherwise you'll need to select and load the model every time you start the LLM server.
+6. At this point you have the server and API, but it's not actually running any Large-Language-Model (LLM) yet. In the web ui, go to the `models` tab and enter a github-style path in the `Download custom model or LoRA` field.  To test so things work, enter `DeepPavlov/bart-base-en-persona-chat` and download. This is a relatively small model (350 million parameters) so should be possible to run on most machines using only CPU. Update the models in the drop-down on the left and select it, then load it with the `Transformers` loader. It should load pretty quickly. If you want to load this every time, you can select the `Autoload the model` checkbox; otherwise you'll need to select and load the model every time you start the LLM server.
 7. To experiment, you can find thousands of other open-source text-generation LLM models on [huggingface.co/models](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending). Beware to not download a too huge model; your machine may not be able to load it! If you try large models, _don't_ set the `Autoload the model` checkbox, in case the model crashes your server on startup.

 For troubleshooting, you can look at the terminal output of the `text-generation-webui` server; it will show you the requests you do to it and also list any errors. See the text-generation-webui homepage for more details.
@ -119,6 +119,8 @@ This is a simple Character class, with a few extra properties:

 ```python
    # response template on msg_contents form.
+    prompt_prefix = ("You will chat and roleplay ")
+
    response_template = "$You() $conj(say) (to $You(character)): {response}"
    thinking_timeout = 2    # how long to wait until showing thinking

--- a/evennia/contrib/rpg/llm/llm_client.py
+++ b/evennia/contrib/rpg/llm/llm_client.py
@ -26,6 +26,7 @@ import json

 from django.conf import settings
 from evennia import logger
+from evennia.utils.utils import make_iter
 from twisted.internet import defer, protocol, reactor
 from twisted.internet.defer import inlineCallbacks
 from twisted.web.client import Agent, HTTPConnectionPool, _HTTP11ClientFactory
@ -37,6 +38,7 @@ DEFAULT_LLM_HOST = "http://127.0.0.1:5000"
 DEFAULT_LLM_PATH = "/api/v1/generate"
 DEFAULT_LLM_HEADERS = {"Content-Type": "application/json"}
 DEFAULT_LLM_PROMPT_KEYNAME = "prompt"
+DEFAULT_LLM_API_TYPE = ""  # or openai
 DEFAULT_LLM_REQUEST_BODY = {
    "max_new_tokens": 250,  # max number of tokens to generate
    "temperature": 0.7,  # higher = more random, lower = more predictable
@ -105,13 +107,54 @@ class LLMClient:
        self.headers = getattr(settings, "LLM_HEADERS", DEFAULT_LLM_HEADERS)
        self.request_body = getattr(settings, "LLM_REQUEST_BODY", DEFAULT_LLM_REQUEST_BODY)

+        self.api_type = getattr(settings, "LLM_API_TYPE", DEFAULT_LLM_API_TYPE)
+
+        self.agent = Agent(reactor, pool=self._conn_pool)
+
+    def _format_request_body(self, prompt):
+        """Structure the request body for the LLM server"""
+        request_body = self.request_body.copy()
+
+        prompt = "\n".join(make_iter(prompt))
+
+        request_body[self.prompt_keyname] = prompt
+
+        return request_body
+
+    def _handle_llm_response_body(self, response):
+        """Get the response body from the response"""
+        d = defer.Deferred()
+        response.deliverBody(SimpleResponseReceiver(response.code, d))
+        return d
+
+    def _handle_llm_error(self, failure):
+        """Correctly handle server connection errors"""
+        failure.trap(Exception)
+        return (500, failure.getErrorMessage())
+
+    def _get_response_from_llm_server(self, prompt):
+        """Call the LLM server and handle the response/failure"""
+        request_body = self._format_request_body(prompt)
+
+        d = self.agent.request(
+            b"POST",
+            bytes(self.hostname + self.pathname, "utf-8"),
+            headers=Headers(self.headers),
+            bodyProducer=StringProducer(json.dumps(request_body)),
+        )
+
+        d.addCallbacks(self._handle_llm_response_body, self._handle_llm_error)
+        return d
+
    @inlineCallbacks
    def get_response(self, prompt):
        """
        Get a response from the LLM server for the given npc.

        Args:
-            prompt (str): The prompt to send to the LLM server.
+            prompt (str or list): The prompt to send to the LLM server. If a list,
+                this is assumed to be the chat history so far, and will be added to the
+                prompt in a way suitable for the api.

        Returns:
            str: The generated text response. Will return an empty string
@ -125,31 +168,3 @@ class LLMClient:
        else:
            logger.log_err(f"LLM API error (status {status_code}): {response}")
            return ""
-
-    def _get_response_from_llm_server(self, prompt):
-        """Call and wait for response from LLM server"""
-
-        agent = Agent(reactor, pool=self._conn_pool)
-
-        request_body = self.request_body.copy()
-        request_body[self.prompt_keyname] = prompt
-
-        d = agent.request(
-            b"POST",
-            bytes(self.hostname + self.pathname, "utf-8"),
-            headers=Headers(self.headers),
-            bodyProducer=StringProducer(json.dumps(request_body)),
-        )
-
-        d.addCallbacks(self._handle_llm_response_body, self._handle_llm_error)
-        return d
-
-    def _handle_llm_response_body(self, response):
-        """Get the response body from the response"""
-        d = defer.Deferred()
-        response.deliverBody(SimpleResponseReceiver(response.code, d))
-        return d
-
-    def _handle_llm_error(self, failure):
-        failure.trap(Exception)
-        return (500, failure.getErrorMessage())
--- a/evennia/contrib/rpg/llm/llm_npc.py
+++ b/evennia/contrib/rpg/llm/llm_npc.py
@ -12,6 +12,7 @@ echo a 'thinking...' message if the LLM server takes too long to respond.

 from random import choice

+from django.conf import settings
 from evennia import Command, DefaultCharacter
 from evennia.utils.utils import make_iter
 from twisted.internet import reactor, task
@ -19,11 +20,21 @@ from twisted.internet.defer import inlineCallbacks

 from .llm_client import LLMClient

+# fallback if not specified anywhere else. Check order is
+# npc.db.prompt_prefix, npcClass.prompt_prefix, then settings.LLM_PROMPT_PREFIX, then this
+DEFAULT_PROMPT_PREFIX = (
+    "You are roleplaying that your name is {name}, a {desc} existing in {location}. "
+    "Roleplay a suitable response to the following input only: "
+)
+

 class LLMNPC(DefaultCharacter):
    """An NPC that uses the LLM server to generate its responses. If the server is slow, it will
    echo a thinking message to the character while it waits for a response."""

+    # use this to override the prefix per class
+    prompt_prefix = None
+
    response_template = "$You() $conj(say) (to $You(character)): {response}"
    thinking_timeout = 2  # seconds
    thinking_messages = [
@ -35,8 +46,19 @@ class LLMNPC(DefaultCharacter):
    @property
    def llm_client(self):
        if not hasattr(self, "_llm_client"):
-            self._llm_client = LLMClient()
-        return self._llm_client
+            self.ndb.llm_client = LLMClient()
+        return self.ndb.llm_client
+
+    @property
+    def llm_prompt_prefix(self):
+        """get prefix, first from Attribute, then from class variable,
+        then from settings, then from default"""
+        return self.attributes.get(
+            "prompt_prefix",
+            default=getattr(
+                settings, "LLM_PROMPT_PREFIX", self.prompt_prefix or DEFAULT_PROMPT_PREFIX
+            ),
+        )

    @inlineCallbacks
    def at_talked_to(self, speech, character):
@ -77,8 +99,18 @@ class LLMNPC(DefaultCharacter):
        # if response takes too long, note that the NPC is thinking.
        thinking_defer = task.deferLater(reactor, self.thinking_timeout, _echo_thinking_message)

+        prompt = (
+            self.llm_prompt_prefix.format(
+                name=self.key,
+                desc=self.db.desc or "commoner",
+                location=self.location.key if self.location else "the void",
+            )
+            + " "
+            + speech
+        )
+
        # get the response from the LLM server
-        yield self.llm_client.get_response(speech).addCallback(_respond)
+        yield self.llm_client.get_response(prompt).addCallback(_respond)


 class CmdLLMTalk(Command):