How to Use Open-Source LLMs to Power a Reliable Team of AI Agents?

Artem Goncharov
10 min readJun 2, 2024

--

A practical, reliable demonstration of AI agents working on a task with open-source LLMs

In this article, I’m going to:

  1. Provide foundational information about AI agents, open-source LLMs, frameworks, and tooling around them.
  2. Create a simple but powerful working example with a team of agents powered by open-source LLMs (no need to pay for any API).
  3. Share some nuances and tricks that help agents work more consistently.

Understanding AI Agents and Tools

Think of agents as autonomous programs that excel at solving specific tasks using tools and an LLM as a universal solver. Tools are programs that enable agents to interact with the external world (like performing internet searches or checking new emails). When an agent needs to act, it sends a complex prompt to the LLM. Different frameworks use various prompting techniques, with ReAct being one of the most popular.

In the CrewAI framework, they use the ReAct pattern and provide two additional tools: “ask co-worker” and “delegate task to co-worker.” These tools are handy for organizing communications between agents, allowing them to solve tasks together using different sets of tools. In the AutoGen framework, something similar can be achieved with GroupChat, but there should be a chat manager to decide the next agent to talk.

In this article, I’ll be using CrewAI as the main framework because it allows the use of tools with almost any open-source LLM.

Programming model

The programming model for using agents usually includes declarative definitions of the graph of agents, tools, tasks, and the set of interactions/connections between them, and then running this graph until the main task is completed.

The image demonstrates one possible graph with two agents assigned three tasks. The first agent can use tools 1 and 3 and can delegate work to the second agent, which can use tools 1and 2. I didn’t include memory and other attributes of agents as they are not required for most tasks and would only complicate the article.

The execution of the graph is driven by the sequence of tasks (the workflow definition), which are usually done sequentially (though there is an option to run them in parallel or even as a Directed Acyclic Graph, or DAG). The agent assigned to the task is responsible for completing it using calls to the LLM, tools, or by delegating part of the task to other agents. Once an agent decides the task is finished, the next task starts immediately, and this process repeats until all tasks are completed.

Infrastructure

From a modules perspective, there are four main components:

  1. Graph Description: This includes agents, tools, and tasks, along with any custom tools you may want to add for additional functionality. In its simplest form, it’s just one Python file with the graph description.
  2. Agents Team Runner: This is the agents framework where the graph runs, including calls to the LLM and invoking tools. It may have a layered design, as seen with CrewAI, which depends on the Langchain framework.
  3. LLM Server: This server handles inference using the chosen LLM. It can be a remote server, like OpenAI’s, or a local one, like ollama, LM Studio, vllm, or even a custom solution using llama.cpp or MLX (for Apple Silicon). Some servers offer more features than others, such as hosting multiple models simultaneously or supporting OpenAI functions/tools protocol.
  4. LLM (Large Language Model): This can be hidden behind remote APIs like OpenAI’s or downloaded from Huggingface and chosen explicitly. Some LLM servers allow you to choose and download models automatically using their UI or command line. Typically, LLMs are stored on Huggingface as multiple files containing the model’s weights. If running locally, you may want to choose a model that suits your PC hardware.

Additionally, there are multiple services you can use through tools. For instance, Serper, which allows for internet searches.

Finally, the real example

Let’s set a task: We want to create a team of agents to write posts about a given topic.

There will be two agents:

  • A writer who will write the post.
  • A researcher who will find the relevant information.

We’ll create only one task: to write a post about a given topic (in this example, “The most interesting cases of using AI in Interior Design in 2024”). In the default example from the CrewAI readme file on GitHub, they created two tasks — one for the researcher and one for the writer. However, I believe there should be one task, with the writer agent pulling information from the researcher. This approach demonstrates delegation effectively, and allows the writer agent to ask the researcher agent for clarifications, leading to a more nuanced final post if the team decides to focus on a specific aspect of the initial topic.

Next, we’ll add an internet search tool to the researcher agent so it can perform search queries.

Finally, we’ll enable delegation for the writer agent so it can delegate tasks to the researcher agent.

The GitHub repository with the full example is here. Instructions on how to run it are in the readme.

Please note that I use my own fork of the CrewAI repository in this example because I fixed some issues related to using open-source models and am waiting for my PRs to be merged. Once they are merged, I’ll update the example to use the main repo and remove this note from the article.

Code walkthrough

Let me briefly walk you through the code.

Firstly, we create an LLM client, which will send requests to the LLM server and receive LLM responses. We use the OpenAI client because it’s a standard that most servers support, allowing you to switch local servers without changing your code (for OpenAI, you need to use a real API key and the proper base URL).

llm = ChatOpenAI(
model=model_name, # use the full name of model with .gguf extension for LM Studio
base_url=api_base, # "http://localhost:1234/v1/"
api_key=api_key, # doesn't matter for LM Studio
temperature=0.01,
)

Then we define agents:

researcher = Agent(
role='Senior Research Analyst',
goal='Uncover cutting-edge developments in AI and data science related to Interior Design',
backstory="""You work at a leading tech think tank.
Your expertise lies in identifying emerging trends.
You have a knack for dissecting complex data and presenting actionable insights.
You can use internet search tool for search but you CAN'T use any tools for reading articles,
so please use ONLY the information from internet search!
""",
verbose=True,
allow_delegation=False,
tools=[SearchTools.search_internet],
llm=llm,
)

writer = Agent(
role='Tech Content Strategist in Interior Design sphere',
goal='Craft compelling content on tech advancements',
backstory="""You are a renowned Content Strategist, known for your insightful and engaging articles.
You transform complex concepts into compelling narratives.
""",
verbose=True,
allow_delegation=True,
llm=llm,
max_iter=3,
)

It’s pretty straightforward — just add the role, which serves both as a name and a description of the agent’s function, and write a goal and backstory. We updated the backstory of the researcher agent compared to the default example by instructing it to use only internet search, as it sometimes tried to use a non-existent tool to open URLs. If you’d like, you can add such a tool from here.

We enabled allow_delegation for the writer and disabled it for the researcher to prevent the researcher from calling the writer back, thus introducing a working hierarchy. We assigned the same LLM client to each agent (though you can create multiple LLM clients and assign different clients to different agents) and set max_iter to 3 to avoid endless cycling in case of severe LLM hallucinations.

Finally, this is how we define a task and a crew:

task = Task(
description="""Develop an engaging blog
post that highlights the most interesting cases of using AI in Interior Design in 2024.
Your post should be informative yet accessible, catering to a tech-savvy audience.
Make it sound cool, avoid complex words so it doesn't sound like AI.
If you need some actual information ask your collegues.""",
expected_output="Full blog post of at least 4 paragraphs",
agent=writer,
max_iter=3,
)

# Instantiate your crew with a sequential process
crew = Crew(
agents=[researcher, writer],
tasks=[task],
verbose=2,
)

This is a very simple definition of the task and the crew. We set the verbose property to 2 so we can see logs and understand what happens.

The last piece related to coding is to create a .env file and add environment variables:

OPENAI_API_BASE = "http://localhost:1234/v1/"
OPENAI_API_KEY = "NA" # don't need it for LM Studio
OPENAI_MODEL_NAME = "model name" # add the full model name here
SERPER_API_KEY = "api key"

I assume you will be using a local LLM server, so I added the local URL. The only two things you need to change are the model name and the Serper API key (you can register at https://serper.dev and get 2500 free searches at the current moment). This key is used by the search_internet tool that utilizes this service.

That’s it for the coding part.

Infrastructure Setup

Let’s talk about infrastructure. There are several LLM servers you can run locally, such as ollama, LM Studio, vllm. As a Mac user, I prefer LM Studio because it has an intuitive UI, allows downloading numerous models from Huggingface, and, most importantly for us, it lets you set up a local server and see requests from our agents in real-time, which is extremely useful for debugging.

Once you’ve installed the LLM server, you need to choose a model. I have successfully used models specifically trained to support function calls, like Mistral Instruct 0.3 and Phi3 Medium Instruct. LLama3 Instruct is also good, but I noticed it requires more iterations to finish a task because it often makes mistakes in tool parameters (keys).

I recommend setting the default temperature to 0or something very close to zero so the LLM will follow your prompts more strictly. Additionally, it’s better to set the maximum number of output tokens to around 1000(or whatever suits you). This protects against endless cycling responses from some models.

With your LLM server installed and a proper model loaded, it’s time to get the crew to work. Use the following commands in the terminal to set up dependencies, activate the Python environment, and run the script:

poetry install 
poetry shell
python example_crew.py

You’re going to see something like this:

> Entering new CrewAgentExecutor chain...

Action: Delegate work to coworker
Action Input: {"task": "Research AI applications in Interior Design", "context": "I need information about how AI is being used in Interior Design this year. Can you provide me with some examples of interesting cases?", "coworker": "Senior Research Analyst"}

> Entering new CrewAgentExecutor chain...

Thought: What are some AI applications in Interior Design?
Action: Search internet
Action Input: {'query': 'AI applications in Interior Design'}

It means that the writer agent started working and immediately delegate research to the researcher agent, which in turn invokes Search internet tool.

The researcher will perform a search and return the results to the writer, who will then produce something like this:

The Future of Interior Design: How AI is Revolutionizing the Industry

As we move into 2024, it's clear that Artificial Intelligence (AI) is having a profound impact on various industries, including interior design. With its ability to analyze large amounts of data and generate realistic renderings of spaces, AI is revolutionizing the way designers work.

One of the most exciting applications of AI in interior design is the use of software that can create realistic renderings of spaces based on inputs such as room size, style, and materials. This allows designers to generate design ideas and layouts quickly and efficiently, without having to spend hours sketching out designs by hand.

Another area where AI is making a big impact is in the analysis of large volumes of data. By analyzing trends, customer preferences, and information about materials and products, AI can help designers make informed decisions about their designs. This includes everything from choosing the right colors and fabrics to selecting the perfect furniture pieces.

Some of the most interesting cases of AI applications in interior design this year include:

* VisualizeAI: A software that uses AI to create realistic renderings of spaces and allows designers to generate design ideas and layouts based on inputs such as room size, style, and materials.
* Homevisualizer AI: A tool that uses machine learning algorithms to analyze large volumes of data, such as design trends, customer preferences, and information about materials and products, which allows designers to make informed decisions.
* REimagine Home AI: A software that uses AI to generate realistic renderings of spaces and allows designers to create custom designs based on inputs such as room size, style, and materials.

These are just a few examples of the many AI applications in interior design this year. As the technology continues to evolve, it's clear that AI will play an increasingly important role in the industry.

I’m not going to judge the quality of the post; instead, let’s discuss another important topic: debugging.

Debugging

When using OpenAI models and their API, everything usually works fine. However, when we switch to open-source models, we need to be prepared for some small issues, especially with tool calls and work delegation. What tools can we use to be able to investigate why the LLM is so stubborn in returning the wrong key in a tool call or not following our prompt?

We already saw the verbose setting in agents and in the crew, which I highly recommend leaving enabled for at least a few initial runs.

Even with all that verbosity, requests and responses from the LLM server are not displayed in the terminal. To see agent requests to the LLM, you can check real-time logs in LM Studio or use LLM callbacks in code. In LM Studio, you will also be able to see additional information, like stop words, which can be very helpful in some cases. To see raw LLM responses (and requests), we need to add some code. The LLM client has three useful callbacks:

  • on_llem_start: When the request to the LLM server is about to be sent.
  • on_llm_new_token: When a new token comes from the server if streaming is enabled.
  • on_llm_end: When the response from the LLM server is received.

We can add an additional callbacks handler to print the raw response (and request if needed). There is a response_logger file in my example project, and we can plug it in like this:

llm.callbacks = []
llm.callbacks.append(ResponseLogger())

The code is already there — just commented out.

That’s it. I hope this will be helpful in your projects with AI Agents!

In the next article, I’ll be creating a crew of agents that will help me with some chores. I started coding and faced a dilemma — should I trust the agent more than my code or vice versa :)

Stay tuned!

--

--