Fine-Tuning LLMs for Multi-Agent Collaboration

Can you fine-tune a language model to be a better collaborator? That was the central question behind our multi-agent LLM project at Northeastern — and the answer turns out to be yes, definitively.

We built a grid-world environment with 7 tasks requiring multi-agent coordination, then fine-tuned LLaMA-3.1 8B with LoRA on successful collaboration trajectories. The fine-tuned 8B model matched or exceeded the performance of the base 70B model across most tasks.

The setup

Multiple LLM-powered agents operate in a shared grid environment. They take turns, can move in four directions, pick up and drop items. Crucially, agents don’t know each other’s positions — they can only communicate through an inbox messaging system, sending one message per turn.

We designed 7 tasks of increasing complexity:

Single-agent navigation — baseline, navigate to a goal
Corner multi-agent navigation — multiple agents must reach assigned corners
Alphabetical line task — agents must arrange themselves in name order
Random points navigation — coordinate to cover dynamically assigned targets
Single-agent pick up item — navigate to an item, grab it, deliver it
Multi-agent pick up items — coordinate item collection across agents
Multi-agent pick up with permissions — only authorized agents can grab specific items

Agent architecture

Each agent outputs a structured JSON response with fields for reflection, rationale, action, message, and memory. The reflection mechanism lets agents evaluate their own past actions before deciding what to do next. The message field enables inter-agent communication — agents discuss task assignments, share goals, resolve conflicts, and confirm task completion.

The communication protocol maps onto the standard LLM role structure: system role provides environment state and rules, user role provides current observations and inbox messages, assistant role is the agent’s response.

Training

We generated training data by running simulations and recording all agent interactions. Trajectories scoring below 50/100 were pruned, leaving 5.3M tokens for training and 1.6M for validation. We fine-tuned LLaMA-3.1 8B using LoRA (rank 8, alpha 16, dropout 0.2) for 6 epochs with a learning rate of 5e-5. Training loss dropped from 0.46 to 0.35.

Results

The results tell a clear story about what fine-tuning buys you:

Single-agent navigation: Base 8B scored ~60, 70B scored ~90, fine-tuned 8B scored 100
Corner navigation: Base 8B scored ~30, 70B scored ~65, fine-tuned 8B scored ~70
Alphabetical ordering: Base 8B scored ~10, 70B scored ~60, fine-tuned 8B scored ~70
Multi-agent pick up with permissions: Base 8B scored ~10, 70B scored ~90, fine-tuned 8B scored ~85

The pattern is consistent: the fine-tuned 8B model rivals or beats the 70B model across tasks, despite being 9x smaller. The largest gains come on the hardest coordination tasks — exactly where the base 8B model struggled most.

Takeaway

Collaboration isn’t just an emergent property of scale. It’s a trainable dimension. A small model with targeted fine-tuning on collaborative trajectories can match a model nearly an order of magnitude larger. The implication for building multi-agent systems: you don’t need massive models if you train specifically for coordination.

Read the full report (PDF) | GitHub