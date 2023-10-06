I'm a blockchain technology lead. My passions are distributed systems, efficient algorithms, and retrocomputing. I have a PhD (Dottorato di Ricerca) in Computer Science and worked as a researcher at university. I’m Italian, which means I’m pretty opinionated about food.

A recent survey by Salesforce indicates that 23 percent of service companies are currently using chatbots, and that percentage is expected to more than double in the next eighteen months. FastChat is a platform for training, serving, and evaluating large language models that are behind many of these chatbots.

In this article, we’ll explore how to use FastChat to implement a simple AI chatbot in a JavaScript web app, which is just one of the functionalities that FastChat makes available within its unique architecture.

Setting up FastChat

FastChat is available as an open source Python library and can be installed using pip :

pip3 install "fschat[model_worker,webui]"

The above command installs fschat along with two optional modules: model_worker , to handle the different models that can be served by FastChat, and webui , to host a web interface for interacting with the chatbot model.

The FastChat platform provides access to multiple chatbot models, and each model is assigned to a worker (hence the model_worker name). The chatbot models are accessible through a fully OpenAI-compatible API server and can be used directly with the openai-python library.

This is a pretty smart design; you can test out different models with your chatbot just by changing the URL you use to address different workers. Most importantly, any pre-existing application that uses the OpenAI API can easily leverage a FastChat server without any modification in the source code.

Choosing a suitable language model

The “large” in large language model (LLM) is a hint that we’ll need large amounts of memory and computational power to train a new model for generative inference. Fortunately, there’s a pretty extensive list of language models that are compatible with FastChat.

To choose the model that’s most suitable for your project’s requirements, consider availability of hardware resources (GPUs, GPU memory, and/or CPU memory) and how much money you want to invest for a cloud service.

Other considerations in choosing a model are the requirements of the system you are going to design in terms of the kind of generation you are aiming for.

Here’s a comparison of some commercially-available LLMs:

GPT-3 : This versatile language model from OpenAI excels in a wide range of natural language processing tasks. Its strength lies in generating coherent and contextually relevant text, making it suitable for tasks like content generation, text completion, chatbots, and creative writing. Its expansive knowledge base allows it to answer questions and provide explanations across various domains

: This versatile language model from OpenAI excels in a wide range of natural language processing tasks. Its strength lies in generating coherent and contextually relevant text, making it suitable for tasks like content generation, text completion, chatbots, and creative writing. Its expansive knowledge base allows it to answer questions and provide explanations across various domains BERT (Bidirectional Encoder Representations from Transformers) : This model, developed by Google, is known for its understanding of context and context-dependent word meanings. Its bidirectional training enables it to capture nuances in language, making it a great choice for tasks requiring deep comprehension. This model is particularly effective for tasks like sentiment analysis, named entity recognition, text classification, and answer generation

: This model, developed by Google, is known for its understanding of context and context-dependent word meanings. Its bidirectional training enables it to capture nuances in language, making it a great choice for tasks requiring deep comprehension. This model is particularly effective for tasks like sentiment analysis, named entity recognition, text classification, and answer generation T5 (Text-to-Text Transfer Transformer): This model, also from Google, is designed to handle a wide range of natural language processing tasks through a unified framework of “text-to-text” conversion. It treats every task like a text generation problem, including text summarization, translation, classification, and more. This approach simplifies the model’s architecture and makes it adaptable for various tasks

For the tutorial portion of this article, we’ll use the google/flan-t5-large model. It runs comfortably on an average laptop, but is powerful enough to fruitfully interact with. It is inspired by Google’s T5 model, but I’m able to run it on an average laptop (so it is not that large 🙂).

To start, install the model using the following command:

python3 -m fastchat.serve.model_worker --model-path google/flan-t5-large

This command sets up a model_worker that will download the model weights and deploy a suitable API to interact with the specified model. The above command line takes --device cpu as an additional parameter, letting the model to run on the CPU and rely only on RAM instead of the VRAM of your GPU.

Next, use the following command to run the model interactively right in the command line:

python3 -m fastchat.serve.cli --model-path google/flan-t5-large --device cpu

Launching the FastChat controller

The controller is a centerpiece of the FastChat architecture. It orchestrates the calls toward the instances of any model_worker you have running and checks the health of those instances with a periodic heartbeat.

This mechanism is well illustrated in the following diagram from the FastChat GitHub repository:

The web interface to FastChat workers is a particular Gradio server that hosts all the UI components needed to interact with the models. The controller is the component that exposes the OpenAI API interface and interacts with the different workers, as shown in the above diagram.

The controller mechanism is agnostic toward the kind of worker you use: both on-premise and cloud workers are acceptable. Use the following command to launch the controller that will listen for the model_worker to connect:

python3 -m fastchat.serve.controller

The controller will use the http://localhost:21001/ virtual server to handle the connection as long as a new model_worker is available.

Creating the UI

The fastest way to build the chatbot’s UI is to use the webui module. This module uses Gradio, a library of web components designed to simplify the deployment of UI for interacting with AI pipelines and chatting with chatbots.

The webui module, available in the gradio_web_server.py on the FastChat repository, provides a simple UI that will address the correct model_worker :



The web app will be available on http://localhost:7860/ and will handle the interaction with the model.

Building a simple chatbot web app

Let’s build a simple web app, written in JavaScript, that uses the OpenAI API hosted by FastChat.

As you might expect, we will need the model_worker , as well as a controller to handle the worker. This time, we’ll use the openai_api_server to host the OpenAI API. This will be a full-fledged OpenAI API but, since we’ll host it on our hardware with a pre-trained model, it will not require the API key.

The code for this example is available on GitHub. The three Python scripts must each be run in a different shell window.

First, launch the controller:

python3 -m fastchat.serve.controller

Then, launch the model worker(s):

python3 -m fastchat.serve.model_worker --model-path google/flan-t5-large

Finally, launch the RESTful API server:

python3 -m fastchat.serve.openai_api_server --host localhost --port 8000

Now, let’s test the API server. Once you have the OpenAI API server up and running you’ll be able to interact with it on http://localhost:8000 .

A quick curl will be enough to test it:

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "google/flan-t5-large", "messages": [{"role": "user", "content": "Hello! What is your name?"}] }'

In the /node-webui directory you’ll find a single index.html file containing the smallest amount of code required to interact with the OpenAI API through a user interface using simple JavaScript:

Here’s the relevant code:

const apiKey = "YOUR_OPENAI_API_KEY"; ... try { const response = await fetch('http://localhost:8000/v1/completions', { method: 'POST', headers: { 'Content-Type': 'application/json', 'Authorization': `Bearer ${apiKey}` }, body: JSON.stringify({ model: "flan-t5-large", prompt: input, max_tokens: 50 }) });

The UI may not be very interesting, but the relevant part here is the interaction with the OpenAI API. As you can see, there is a simple invocation of the API by passing the standard API parameters.

In particular, the model parameter must contain the name of the model we want to address. This will be passed to the controller in order to pass the call to the relevant worker_model .

In the header there is another parameter, the Authorization field, which in the standard OpenAI contains the API key we’d have to buy. In the case of this demo, that API key is just a string that will not be used in any way since the model only runs on our infrastructure (which in my case, is my laptop).

Conclusion

FastChat is an amazing package that is designed to benchmark, interact, and experiment with a plethora of LLMs. It offers a quick way to host a chat web interface using Gradio, a standard framework for building machine learning web apps.

FastChat is useful for testing the model, but in the most general case of hosting your own application, you can also have a proper OpenAI API that will accommodate any pre-existing code compatible with it.

FastChat’s architecture is also particularly efficient in terms of expandability: more LLMs can be hosted in order to test them side by side on the same input, or to have, for instance, specialized LLMs that can be orchestrated for a given task.

