Gorilla LLM: Bridging APIs with User-Specified Tasks

An interview with the team at UC Berkeley

Jun 15, 2023

The team: Shishir G. Patil, and Tianjun Zhang are PhD students at UC Berkeley in Computer Science. Xin Wang is a senior researcher at Microsoft Research. Prof. Joseph E. Gonzalez is a professor in EECS at UC Berkeley and the lead faculty on the Gorilla Project.

For anyone learning about Gorilla LLM for the first time, can you give a short description of the project? How did this idea come about and/or what problems were you trying to solve?

Joseph Gonzalez: Gorilla is designed to connect large language models with a wide range of services and applications exposed through APIs. Imagine if ChatGPT could interact with thousands of services ranging from Instagram and Doordash to tools like Google Calendar and Stripe to help you accomplish tasks. You could ask to book a meeting for your collaborators, order their favorite foods, and pay for it. This may be how we interact with computers and even the web in the future.

Gorilla is an LLM that we train using a concept we call - retriever-aware training, that can pick the right API to perform a task, that a user can specify in natural language. We also introduce an Abstract Syntax Tree (AST) based sub-tree algorithm, which for the first time can measure hallucination of LLMs!

It looks like Gorilla is a fine-tuned LLaMA based model. Today, there are many open source options to choose when selecting a base model. Why was LLaMA chosen and not another model? Were multiple models fine-tuned and tested?

Shishir G. Patil: We chose LLaMA to start off with as it was considered the working horse of open-sourced LLMs. Many other models were derivatives of it for specific applications. Ofcourse, we benchmarked Gorilla with GPT-4, GPT-3.5, and Claude-v1 which are considered the state of the art. However, this has changed since then - It's only been a week, but it sounds like forever :) We have released two more Gorilla models, based on MPT-7B and Falcon-7B, because a lot of our users wanted to try it out commercially. We now have an Apache 2.0 license, which means Gorilla can be used commercially with no obligations!

Gorilla seems to be a single component within a multi-component system. For instance, there is an API database that is used to fetch the relevant context for the LLM. For some, it might not be obvious why this approach was taken. Can you explain the overall architecture and why certain approaches like this were taken?

Shishir G. Patil: Gorilla is an LLM that can be used in one of two modes. The first (and most popular) is the zero-shot mode. In this case, Gorilla takes the user’s query in natural language and returns the right API to call. Now, there are many scenarios where you often see the APIs evolve with time - this could be versioning, or the end-points may change, or the arguments may be shuffled around, or some of them may get deprecated. To make our system robust to this, we introduce a second mode of using Gorilla - retriever aware. In this case, Gorilla picks the most relevant API which is then appended to the user’s prompt. This enables us to be cognizant to changes in APIs.

Gorilla is fine-tuned on a set of API documentations, but these APIs are often changing with many variations (e.g., versioning, deprecation). Gorilla can be extended to retrieve updated API documentation before making its inference (retriever aware) but can you explain if there will be a time when continuous fine-tuning on every update will be possible? Or will Gorilla aim to do periodic fine-tuning on a monthly/yearly schedule? Today the complexities and cost of doing this don't make sense.

Tianjun Zhang: The cost of fine-tuning is high, but there are many exiciting proposals to bring it down. However, I don't know if people will like a model that is changing ever so often - people like consistency. So yeah, the goal would be to fine-tune it periodically with a cadence that is acceptable to the development cycles of APIs, but not too frequent.

Latency is one of the biggest pain points when working with LLMs. What is Gorilla's average inference latency (considering varying hardware) and are there techniques you have found to improve overall latency?

Shishir G. Patil: Latency is indeed an open problem, which is challenging given the auto-regressive nature of generation in LLMs. For now, we haven’t really measured it but in our hosted colab, where we let users try Gorilla for free, we observe an average https request including generating the API call to be around 6.4s - which is not something users have complained about. Of course this can be easily optimized, but we were focused on other aspects - like releasing an Apache 2.0 model and haven’t really looked at optimizing this. That’s a ToDo for us, and welcome contributions.

Note: This is likely very important research if we want LLMs to be commercially viable while maintaining existing user experience in various software applications (e.g., perceived user latency), aside from just throwing hardware at it.

It seems like Gorilla suffers from hallucinations from "zero-shot" inference (without retrieval), similar to other LLMs when invoked directly without constraints and context. Can you explain why hallucinations increase in this situation?

Tianjun Zhang: Actually Gorilla already does a better job to greatly reduce hallucinations compared to even GPT-4. But like other LLMs, when zero-shot prompted, Gorilla provides answers purely relying on its internal knowledge base. It is typically hard, even for humans, to answer such a question without checking any external sources or citations. Another difficulty comes from the brittleness of API calls, even if you misspell one character/word, this counts for hallucinations. This makes the task even harder compared to normal chatting based hallucinations.

How much time and what hardware did it take to train Gorilla and were there any special techniques used during training?

Shishir G. Patil: We used a 8 A100 40GB GPU node to train and evaluate all the models. The time it takes varies a lot depending on the model, and the API data-set. The smallest runs were around a total of 10 GPU hours, while the longer ones were around 120 GPU hours. We used all the state of the art techniques for compute (efficient attention mechanisms) and memory optimizations (sharding, checkpointing, and mixed-precision training). We did not use LoRA and all Gorilla models are end-to-end fine-tuned.

During training, it looks like you are appending to the user prompt to reduce hallucinations ("Use this API documentation for reference: <retrieved_API_doc_JSON>"). In your testing, how did you come up with this prompt and were there any other prompts that could have worked? Why does this work so well?

Shishir G. Patil: We really didn’t spend much thought on it, as it was a relatively simple prompt. We also tried a few variations of the above, and they seemed to work just fine. Not using the prompt doesn’t work though - as the model then gets confused what the user’s query is, and what the API documentation is.

Gorilla claims to perform better than GPT-4 for the same instructions. In my own testing with GPT-4, with system prompt instructions, it performs pretty well and hallucinates less when compared to other models (not fine-tuned). Can you provide more details on the prompt used to test against GPT-4? Was there a test done to add detailed prompt constraints and instructions with GPT-4 to compare against?

Shishir G. Patil: Great question! And this highlights the challenge - today we hear a lot of comments on “Oh GPT-4 hallucinates a lot” or the other school that says “GPT-4 hallucinates less”. And when we ask by how much, we usually get a shrug of shoulders. With Gorilla, we introduce the Abstract Syntax Tree (AST) sub-tree matching - a concept in Programming Languages - to quantify for the first time, albeit in a restricted domain, how much these LLMs hallucinate. And we are able to for the first time measure and report on hallucinations.

Can you provide a prompt example that was used to test GPT-4? Assuming it wasn't a "zero-shot" test and there were some system prompt instructions included in the test.

Shishir G. Patil: We kept the prompt simple as that didn't seem to matter much, given we wanted brief and crisp answers. Here is an example for a given user "question":

question + "\nWrite a python program in 1 to 2 lines to call API in " + api_name + ".\n\nThe answer should follow the format: <<<domain>>> $DOMAIN, <<<api_call>>>: $API_CALL, <<<api_provider>>>: $API_PROVIDER, <<<explanation>>>: $EXPLANATION, <<<code>>>: $CODE}. Here are the requirements:\n" + domains + "\n2. The $API_CALL should have only 1 line of code that calls api.\n3. The $API_PROVIDER should be the programming framework used.\n4. $EXPLANATION should be a step-by-step explanation.\n5. The $CODE is the python code.\n6. Do not repeat the format in your answer.

For API document retrieval, what is the search algorithm being used to get the top-1 match to the user's request? How does this compare to the design of vector storage for text embeddings and semantic retrieval for prompt injection?

Shishir G. Patil: We have three retrievers. One of which is an Oracle retriever - in this setting, we give it the correct API to use - this is to benchmark how good the systems would be if retrieval was “solved”. Now, for the other two settings - in BM25, we use a simple cosine similarity search to get top-1 and for the GPTIndex, we use OpenAI’s Davinci-001 for generating the embeddings and top-1 search. Given we had an oracle retriever to benchmark against, and a zero-shot setting at the other end, we think all the other retrievers would occupy a point in this continuous design space. We expect this space to heat-up even more as we look at multi-modal data, and documents with contexts 100X bigger than what we can handle today.

Are there any downsides to AST sub-tree matching to map Gorilla's response to the appropriate API (ex. leaf node matching for optional parameters)? Were there any other methods tested and do you see alternatives in the future?

Shishir G. Patil: AST sub-tree matching has been a popular choice in the Programming Languages (PL) research domain. This only works very well when you can build an exhaustive tree to compare against. As you can tell, this is hard to generalize beyond structured data. Like what do you do for multimodal outputs? While we took a first step towards it, I do think quantifying hallucination is still an open problem in the wider domain.

How did you decide Gorilla's response format and were other formats considered? Today, you notice a lot of experiments forcing LLMs to respond in JSON for easier parsing. Was that an option?

Shishir G. Patil: Our initial prototype tried to force every API doc in our APIBench, question, and every response to be a fully qualified json. However, we found that enforcing this was besides the point, especially when you start including code in your responses. You can see how the double-single quotes in code start messing with the json format. To overcome this, we just introduced some markers to flag when a field started or ended. This might be a hot-take, but I think the community should just stop forcing LLMs to respond in json. It’s hard even for humans to come up with it, all the web-sites that host json parsers also fail in the presence of complicated nested fields. So, why force the LLMs?

In your opinion, aside from Gorilla's API use case, is raw text output (unstructured) always the optimal response format? Or is there another parsable (structured) format for LLMs that works better than JSON?

Shishir G. Patil: Post our experience, I am less and less convinced that we could generalize a template for all LLM outputs. Successful models in the future are the ones that are well adapted to respond in the format request by the user. This can be enforced though prompts, or fine-tuning. Either is ok. But making a bet one way or the other, would be a mistake.

There are some concerns with allowing LLMs to execute code or even API calls. LLMs, including Gorilla, are just providing the instruction and it is up to the enclosed system (or user) to execute the instructions. Are there any safety checks or constraints that you recommend before executing Gorilla's instructions? How do you see safety and security shaping in this industry?

Joseph Gonzalez: This is one of the more interesting and challenging aspects of the research agenda. Currently, the model cannot actually invoke the API directly but instead gives the API call to a human. In the next iteration, we would like to allow the system to ask the human operator if it can run the call on the human’s behalf and make any necessary corrections in the presence of errors. While delegating approval to a human operator may appear to provide increased safety, the reality is that the LLM may still craft a malicious API call that the human operator doesn’t fully understand. The next avenue of research in this area is to build LLM technology to assess and communicate the consequences of API actions. For example, an API call to book a flight might note that “The LLM is about to book a non-refundable flight on your behalf.” This is also an area where the APIs (and more specifically their creators) could help to communicate consequences of their invocations.

How do you see Gorilla evolving in the future? Will there be additional narrow focused LLMs that you will experiment with? If so, in what areas?

Joseph Gonzalez: I would really like the work on Gorilla to expand to a broader set of tools that we use daily and enable the composition of these tools to complete more interesting tasks. In addition to addressing some of the safety challenges around tool use, I also would like to explore how Gorilla communicates its plans to human operators to help them be more informed about what the AI is doing on their behalf.

If autonomy is the goal, I can see Gorilla as a single LLM within a network of LLMs all tasked to perform towards a given objective. Is this the vision for the team as well? Why or why not?

Joseph Gonzalez: The future of LLMs will likely involve general LLMs coordinating task specific LLMs. In some sense, Gorilla, is an exploration of task specific LLMs for API invocation. I imagine a future where general purpose planning LLMs help users assemble the steps in some broader task (e.g., booking a vacation) and then delegate to agent LLMs specific tasks that may involve invoking multiple API to collect information, get additional guidance from the human operator, and ultimately complete the plan.

What are your thoughts on autonomous LLM agents and how do you see this space advancing?

Shishir G. Patil: I am super excited about the autonomous agents space. While chat-bots in themselves are a great demonstration of the technology, and what it enables. In the long run, I expect LLMs to be another set of tools in our tool-box, albeit a powerful one, that we will use to complete particular tasks. And agents will let us get to it. Keep in mind, I say autonomous agents to mean autonomy in terms of reach, but they will still be of use and assistants to humans.

In this fast paced environment of LLM research and development, Gorilla LLM introduced some novel concepts to solving today’s challenges, notably with inferences that generate executable API code and AST sub-tree matching to measure accuracy. This will definitely lead to fresh ideas, potentially redefining the landscape of software application development.

Thankful to the team at UC Berkeley for their time and valuable insights with this interview!

Looking to interview startups or engineers building with LLMs. Reach out: samir@sudoapps.com.

Sudo Apps

Discussion about this post