wenda - An interesting open-source natural language model that produces results comparable to large models

Preface#

This article briefly records the local usage of wenda.

Wenda is an open source natural language model that supports local knowledge base and online search. It achieves similar query summary output as AutoGPT and achieves comparable results to ChatGPT using small models.

Main Content#

1. Introduction to Wenda#

Wenda is an LLM calling platform designed to provide small models with the ability to search knowledge bases similar to large models. It supports chatGLM-6B, chatRWKV, chatYuan, and llama series models, and provides functions such as automatic saving of conversation history, knowledge base expansion model capability, online parameter adjustment, local area network and intranet deployment, and simultaneous use by multiple users.

2. Wenda Installation#

1. Install Wenda#

Download the project files and install common dependencies,

git clone https://github.com/l15y/wenda.git
pip install -r requirements.txt

Install different dependency libraries according to the model to be used.

2. Download LLM Models#

It is recommended to use the ChatGLM-6B-int4 and RWKV-4-Raven-7B-v10 models. ChatGLM-6B-int4 can run on 6GB VRAM or memory, and RWKV depends on the allocation strategy.

ChatGLM

ChatGLM-6B is an open source dialogue language model that supports both Chinese and English. It is based on the General Language Model (GLM) architecture and has 6.2 billion parameters. Users can deploy it locally on consumer-grade graphics cards (minimum 6GB VRAM required at INT4 quantization level) and customize their own application scenarios using the efficient parameter fine-tuning method P-Tuning v2. It should be noted that the model currently has certain limitations, such as the possibility of generating harmful/biased content.

The official provides a lazy package download, which contains the main program and multiple models. You can also find more optimized versions of the model on huggingface.

If you want to run on Apple Silicon GPU, you need to modify the GLM6B strategy item to mps fp16 in the config.xml file, and add two lines in the plugins/llm_glm6b.py file:

RWKV

ChatRWKV is a chat tool similar to ChatGPT, but driven by the RWKV (100% RNN) language model. It has comparable scalability and quality to transformers, but is faster and saves VRAM. It also provides v2 versions, including streaming and separation strategies, as well as INT8. Users should check the text of the state when building a ChatRWKV chatbot to prevent errors and use the recommended format for chatting.

Currently, RWKV has a large number of models corresponding to various scenarios and languages:

Raven Model: Suitable for direct chat, suitable for +i command. There are versions in many languages, so be sure to choose the correct one. Suitable for chatting, completing tasks, writing code. Can be used as a task to write documents, outlines, stories, poems, etc., but the writing style is not as good as the testNovel series models.
Novel-ChnEng Model: Chinese-English novel model, can generate world settings with +gen (if you can write prompts, you can control the plot and characters), can write science fiction and fantasy. Not suitable for chatting, not suitable for +i command.
Novel-Chn Model: Pure Chinese online novel model, can only continue writing online novels with +gen (cannot generate world settings, etc.), but the writing is better (also more novice-friendly, suitable for writing male and female genres). Not suitable for chatting, not suitable for +i command.
Novel-ChnEng-ChnPro Model: Fine-tuned Novel-ChnEng on high-quality works (classics, science fiction, fantasy, classical, translation, etc.).

Wenda recommends using RWKV-4-Raven-7B-v10-Eng49%-Chn50%-Other1%-20230420-ctx4096, which can be downloaded from huggingfaces.

RWKV has the following model strategies to choose from:

The RWKV-4-Raven-7B-v10 model is 13GB in size. If you load it with cuda fp16, you need 15GB+ VRAM. It is not friendly to consumer-grade graphics cards, so you can choose strategies like cuda fp16i8 *20+. If the VRAM is still not enough, you can reduce the number of layers, such as *18+. This is using the GPU in a streaming manner, which is slower than loading all at once. Some people even push the limits and run it on 2GB VRAM. You can refer to this article for the cuda fp16i8 *0+ -> cpu fp32 *1 strategy.

If there is enough VRAM, it is still recommended to load all with cuda. Compared with ChatGLM, RWKV has a significant advantage in speed with the same number of parameters.

Compile RWKV's cuda_kernel#

In RWKV, there is an environment variable RWKV_CUDA_ON, which can compile RWKV's cuda_kernel to further accelerate inference (although it is already very fast without acceleration😀). It requires gcc5+ and configuration of CUDA_HOME. In my test environment, I used PyTorch 1.12 and CUDA 11.3. When compiling the kernel, an error #error You need C++14 to compile PyTorch occurred. At this time, you need to modify the model.py file in the rwkv library and change the compilation option -std=c++17 to -std=c++14 to complete the compilation.

LLaMa

There are many models in the llama series. It is recommended to use vicuna-13B and choose the ggml int4 quantization version. You can download it here.

It should be noted that when installing llama-cpp-python, compilation may fail on some machines and gcc11 can be used to compile successfully. Here are the steps to compile gcc11 and compile llama-cpp-python using gcc11:

Compile and install gcc11

git clone --branch releases/gcc-11.1.0 https://github.com/gcc-mirror/gcc.git
# Enter the gcc-11.1.0 directory
./contrib/download_prerequisites
./configure --prefix=/usr/local/gcc-11.1.0 --enable-bootstrap --enable-languages=c,c++ --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib
Install the dependencies required for compiling 32-bit C/C++
yum install glibc-devel.i686 libstdc++-devel.i686
# Start compilation
make -j$(nproc) && make install

Compile and install llama-cpp-python

export CC=/usr/local/gcc-11.1.0/bin/gcc
export CXX=/usr/local/gcc-11.1.0/bin/g++
pip install llama-cpp-python

OpenAI API

In wenda, you can also directly use the OpenAI API, combined with the knowledge base function, to achieve better results.

3. Using Wenda#

After installing the main project and downloading the models, you can configure the config.xml file to set the model loading path and knowledge base mode parameters.

Knowledge Base Mode#

The knowledge base works by generating some prompt information, which will be inserted into the conversation.

The project has the following modes:

bing mode, cn.bing search, only available in China
bingxs mode, cn.bing academic search, only available in China
bingsite mode, bing site search, requires setting the website address
st mode, sentence_transformers+faiss for indexing
mix mode, fusion
fess mode, fess search deployed locally, and keyword extraction

It can be divided into online and local categories. Online mode uses bing search to query information online, while local mode uses personal knowledge to supplement.

For more details, please refer to the official mode introduction.

After configuring everything, execute python wenda.py to access the URL and use it.

The following are the test results when using the RWKV model:

When writing a paper, it supports automatic completion of sub-goals with question and answer:

Results for another topic:

Test results using ChatGLM-6B-int4:

It seems that ChatGLM-6B performs better, and there doesn't seem to be much difference in speed.

As for the llama series, the test results are very poor, so it is better not to use them.

4. Conclusion#

The overall experience of using wenda is good. I have tried models including ChatGLM, RWKV, LLaMa, gpt4all, and LLaVa. These models have slightly worse performance than ChatGPT in direct use, but the test results using wenda surprised me and achieved a significant improvement. Among them, ChatGLM-6B performs the best, and using the OpenAI API can achieve even better results.

However, even with the knowledge base function, it is still difficult to guarantee accuracy in generation. This is also the case with ChatGPT, where nonsense can be generated. Therefore, careful discernment of content validity is still required when using it.

Overall, when using wenda, different models can provide a natural and smooth conversational experience. It is a great project for those who want to deploy locally.