Training Large Language Model to connect with the API
In May 2023, a paper titled “Gorilla: Large Language Model Connected with Massive APIs” emerged, introducing an LLM that has been fine-tuned to address the challenge of connecting LLMs with APIs. This breakthrough allows the LLM to perform various tasks, including data calculations and even self-improvement.
Gorilla claims that their LLM model can reduce hallucinations in generated code compared to code produced by GPT-4 or Claude. Our analysis suggests that this is due to Gorilla’s dataset, which is sourced from the latest information on existing APIs, while the data used for GPT-4 or Claude may have last been updated in 2021. This up-to-date dataset likely enables Gorilla to generate more accurate and reliable code by leveraging the most current API knowledge available.
Training
Basically, Gorilla performs fine-tuning on the Llama 7b model using appropriately formatted data. The hardware specifications required are 125 GB — 250 GB of RAM or approximately 25 GB * 4 GPUs. These specifications can potentially be reduced by utilizing NVMe SSD for offloading model parameters or by training using the Lora method with 8-bit quantization.
By employing NVMe SSD for offloading model parameters, the memory requirements can be reduced as the parameters are stored externally, freeing up RAM for other tasks. This can help optimize the hardware specifications and potentially lower the RAM needed for running the model.
Another approach is training with the Lora method using 8-bit quantization. This technique reduces the precision of numerical values from 32 bits to 8 bits, significantly reducing the memory footprint without compromising the overall performance of the model. This allows for more efficient memory utilization and potentially reduces the hardware requirements for running the model.
Both of these approaches can be explored to optimize the hardware specifications needed for running Gorilla’s LLM model, potentially reducing the RAM requirement to a more manageable level.
In this case, you are using the FastChat script from lm-sys. FastChat is an open platform specifically designed for training, serving, and evaluating large language model-based chatbots. It offers several core features, including:
- The weights, training code, and evaluation code for state-of-the-art models (e.g., Vicuna, FastChat-T5).
- A distributed multi-model serving system with web UI and OpenAI-compatible RESTful APIs.
First, clone git repo and install depedency
git clone https://github.com/lm-sys/FastChat.git
cd FastChat
pip3 install -e .
Secondly, you need to format the data according to the example provided in the fastchat, as shown in the code below. It is essential to ensure that the information in the dataset you select as a response is accurate and executable by the machine. For instance, if you wish to generate data for a REST API from user prompts, you will need to format the data roughly as follows.
{
"api_name": "name of API",
"api_call": "localhost:8000/get_data",
"api_methods": "GET",
"api_params": ["id"],
"prompt_params": [{
"id": 2100
}]
}
Then, The data needs to be reformatted into instruction and output data, as shown below:
[
{
"id": "identity_0",
"conversations": [
{
"from": "human",
"value": "Who are you?"
},
{
"from": "gpt",
"value": "{\"api_name\": \"name of APIj\", \"api_call\": \"localhost:8000/get_data\", \"api_methods\": \"GET\", \"api_params\": [\"id\"], \"prompt_params\": [{\"id\": 2100}]}"
},
]
},
You can refer to the more comprehensive dummy data format provided by FastChat at the following link:
The next step is to proceed with the training. If you refer to the example script provided by FastChat at https://github.com/lm-sys/FastChat/blob/main/scripts/test_train.sh, you’ll notice that it demonstrates the usage of torchrun
. However, torchrun
has relatively lower performance. We can improve this by using deepspeed
. To install deepspeed
, you need to run the following command:
pip install transformers[deepspeed]
For more detailed information about deepspeed
, you can refer to the following page:
Next, you need to navigate to the FastChat script directory and create a DeepSpeed configuration. Below is an example configuration, but you need to adjust it according to the hardware components you have:
{
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"steps_per_print": 50,
"gradient_clipping": 1.0,
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu"
},
"contiguous_gradients": true,
"overlap_comm": true
},
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "Adam",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"activation_checkpointing": {
"partition_activations": true,
"contiguous_memory_optimization": true
},
"wall_clock_breakdown": false
}
After that, you need to modify the test_train.sh
script to make it compatible with DeepSpeed. Here is an example of the modified script:
deepspeed \
--master_port=20001 ../fastchat/train/train.py \
--save_total_limit 2 \
--model_name_or_path /path/to/model/llama-7b \
--data_path /path/to/data.json \
--fp16 True \
--output_dir gorilla-model/ \
--num_train_epochs 5 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "steps" \
--eval_steps 6 \
--save_strategy "steps" \
--save_steps 6 \
--logging_steps 6 \
--learning_rate 1e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--tf32 False \
--model_max_length 2048 \
--gradient_checkpointing True \
--lazy_preprocess True \
--report_to "none" \
--deepspeed ds_config.json \
To execute the script and start the fine-tuning process, run the command bash test_train.sh
. Once the fine-tuning process is completed, you can serve the model using FastChat or perform inference directly in the command-line interface (CLI) using the following command:
python -m fastchat.serve.cli --model-path gorilla-model
Make sure to replace gorilla-model
with the actual path to your fine-tuned model. By running the above command, you will obtain the model’s generated response based on the provided input text. Alternatively, you can run the model using an API or Gradio by following the instructions provided by FastChat :
Conclusion
The concept provided by Gorilla is crucial in expanding the capabilities of LLM in executing specific tasks. However, it’s important to note that this concept needs to be followed up with the execution of the output generated by LLM, rather than stopping at the output produced by Gorilla.
References
Gorilla Paper: https://arxiv.org/abs/2305.15334