Paper review — Communicative Agents for Software Development

A detailed review of the “ChatDev” AI Agent paper

Screenshot of ChatDev paper cover

After reading and reviewing the Generative Agents paper, I decided to explore the world of AI coding Agents. My next stop on this journey was the paper titled “Communicative Agents for Software Development”, also known as ChatDev.

ChatDev presents an innovative paradigm to software development by leveraging large language models to streamline the entire software development process through just natural language communications between a human user and AI Agents.

As you can guess, this is an ambitious undertaking, which made the paper an equally exciting read.

At its core, ChatDev is a virtual, chat-powered software development company that brings together software agents to code, design, test, and produce a given application.

In this post we will explain the motivation behind this work, and then dive into the architecture of ChatDev. At the end we will present the findings from this paper, and share our own thoughts on this work. Let’s go!

Why do we want AI to build software applications?

To many, software is the magic of our world. Just like in mystical realms where wizards are able to cast spells to create physical objects, in our reality software engineers are able to create all sorts of programs that augment, automate, and enhance our lives.

Yet, building software is not trivial. It requires hard skills, team work, experience, intuition, and taste. It is also expensive.

These elements make it difficult to automate the creation of software.

Many individuals and businesses across the world would like to create software programs for profit and fun, but don’t have the skill nor capital to do so. This leaves us with huge unrealised potential and unmet opportunities that could improve people’s lives and enrich economies.

However, recent advances in artificial intelligence — specifically deep learning and large language models — now enable us to approach this challenge with moderate levels of success.

In ChatDev, the researchers set out the ambitious task of generating entire software programs by leveraging the power of large language models.

ChatDev architecture

ChatDev is a virtual, chat-powered software development company that mirrors the established waterfall model for building software. It does so by meticulously dividing the development process into four distinct chronological phases: designing, coding, testing, and documenting.

Each phase starts with the recruitment of a team of specialised software agents. For instance one phase could involve the recruitment of the CTO, Programmer, and Designer Agents.

Screenshot — Phase + Chat-Chains — from ChatDev paper

Each phase is further subdivided into atomic chats called. A chat-chain represent a sequence of intermediate task-solving chats between two agents. Each chat is designed to accomplish a specific goal, which counts towards the overarching objective of building the desired application.

The chats are sequentially chained together in order to propagate the outcome from a previous chat from two AI Agents to a subsequent chat involving two other AI Agents.

Addressing code hallucinations

One of the key challenges tackled by ChatDev is the issue of code hallucinations, which can arise when directly generating entire software systems using LLMs.

These hallucinations may include incomplete function implementations, missing dependencies, and undiscovered bugs. The researchers attribute this phenomenon to two primary reasons:

1. Lack of granularity and specificity: Attempting to generate all code at once, rather than breaking down the objective into phases such as language selection and requirements analysis, can lead to confusion for LLMs.

2. Absence of cross-examination and self-reflection: lack of adequate and targeted feedback on the work conducted by a given agent results in incorrect code generation which is not addressed by the LLM.

To address these challenges, ChatDev employs a novel approach that decomposes the development process into sequential atomic subtasks, each involving collaborative interaction and cross-examination between two roles.

This is an efficient framework that enables strong collaboration among agents, which leads to better quality control overall of the target software that the agents set out to build.

Anatomy of ChatDev phases

Each phase begins with a role specialisation step, where the appropriate agents are recruited for the phase and given the role they must endorse.

Each chat in the chain is composed of two agents which assume one of the following roles:

  • instructor agent: initiates the conversation, and guides the dialogue towards completion of the task.
  • assistant agent: follows the instructions given by the instructor agent, and works towards completing the task.

Instructor and assistant cooperate via multi-turn dialogues until they agree they have successfully completed the task

Phase 1 — Designing

This phase involves the CEO, CTO, and CPO agents.

In this initial phase, role specialisation is achieved via inception prompting. Inception prompting is a technique from the CAMEL paper that aims to expand an original statement into a more specific prompt with clear objectives for the instructor and assistant agents to work on completing.

Screenshot — Example of Inception Prompting — from CAMEL paper

Similarly to in Generative Agents, a Memory Stream is also used in ChatDev. The Memory Stream contains the history of conversations for each phase, and for each specific chain.

Unlike the Memory Stream from Generative Agents, the researchers from ChatDev do not employ a retrieval module nor implement memory reflection. This is probably due to the sequential nature of the phases and chains, which makes the flow of information from previous steps predictable and easy to access.

To complete a chat the instructor and assistant agree at the end of a multi-turn conversation by each uttering the same message in this format, e.g. “<MODALITY>: Desktop Application” .

Self-reflection mechanism is used when both agents have reached consensus without using the expected string to end their chat. In this case the system creates a pseudo-self of the assistant and initiates a fresh chat with the latter (see image above for more details).

Screenshot — Steps in Designing phase — from ChatDev paper

In this chat the pseudo-self asks the assistant to summarise the conversation history between the assistant and the instructor so that it can extract the conclusive information from the dialogue.

Phase 2 — Coding

This phase involves the CTO, Programmer and Designer agents.

The coding phase is further decomposed into the following chats:

Generate complete codes: the CTO instructs the Programmer to write code based on the specifications that came out of the designing phase. These specifications include the programming language of choice (e.g. Python) and of course the type of application to build. The Programmer dutifully generates the code.

Devise graphical user interface: the Programmer instructs the Designer to come up with the relevant UI. The Designer in turn proposes a friendly graphical user interface with icons for user interactions using text-to-image tools (i.e. diffusion models like Stable Diffusion or OpenAI’s DALLe). The programmer then incorporates those visuals assets into the application.

ChatDev generating code using Object Oriented Programming languages like Python due to its strong encapsulation and reuse through inheritance. Additionally, the system only shows agents the latest version of the code, and removes from the memory stream previous incarnations of the codebase in order to reduce hallucinations.

Screenshot — Thought Instruction — from ChatDev paper

To further combat hallucinations, thought instruction is employed. In thought instruction, roles between agents are temporarily swapped. For instance the CTO and the Programmer are swapped for a moment. In this case the CTO inquires about unimplemented methods, allowing the programmer to focus on specific portions of the codebase.

Essentially, with thought instruction, a single big task (e.g. implementing all non-implemented methods) is broken down into smaller ones (e.g. implement method 1, then implement method 2, etc.). Thought instruction is itself derived from chain-of-thought prompting.

Phase 3 — Testing

The testing phase involves integrating all components into a system and using feedback messages from an interpreter for debugging. This phase engages three roles: the Programmer, the Reviewer, and the Tester.

The following chats are involved:

Peer review: the Reviewer agent examines the source code to identify potential issues without running it (static debugging). The Reviewer agent attempts to spot obvious errors, omissions, and code that could be better written.

System resting: the Tester agent verifies the software execution through tests conducted by the Programmer agent using an interpreter (dynamic debugging), focusing on evaluating application performance through black-box testing.

Here again, thought instruction is employed to debug specific parts of the program, where the Tester analyses bugs, proposes modifications, and instructs the Programmer accordingly.

Additionally, ChatDev allows human clients to provide feedback and suggestions in natural language, which are incorporated into the review and testing processes.

Phase 4 — Documenting

The documenting phase comprises the generation of environment specifications and user manuals for the software system. This phase engages four agents: the CEO , the CPO, the CTO, and the Programmer.

Using few-shot prompting with in-context examples, the agents generate various documentation files.

The CTO instructs the Programmer to provide configuration instructions and dependency requirements (e.g., requirements.txt for Python), while the CEO communicates requirements and system design to the CPO, who generates a user manual.

Screenshot — Steps in Documenting Phase — from ChatDev paper

Large language models are used to generate the documentation based on the prompts and examples provided, resulting in a comprehensive set of documentation files to support the deployment and usage of the software system.

Evaluation and Observations

In an evaluation with 70 software tasks, ChatDev demonstrated impressive results:

  • It generated an average of 17.04 files per software, including code files, asset files created by the designer, and documentation files.
  • The generated software typically ranged from 39 to 359 lines of code, with an average of 131.61 lines, partly due to code reuse through object-oriented programming.
  • Discussions between reviewers and programmers led to the identification and modification of nearly 20 types of code vulnerabilities, such as “module not found,” “attribute error,” and “unknown option” errors.
  • Interactions between testers and programmers resulted in the identification and resolution of more than 10 types of potential bugs, with the most common being execution failures due to token length limits or external dependency issues.
  • The average software development cost with ChatDev was $0.2967, significantly lower than traditional custom software development companies’ expenses.
  • It took 409.84 seconds on average to develop small-sized software. This of course compares favourably against the weeks (or months) expected to build similar application with a human software company.
Screenshot — Analysis of the time ChatDev takes to produce software — from ChatDev paper

Limitations acknowledged by researchers

While these results are encouraging, the researchers acknowledged several limitations.

Even using a low temperature (e.g. 0.2), the researchers still observed randomness in the generated code output. This means the code for the same application may vary between runs. The researchers thus admitted that at this stage ChatDev is best used to brainstorm or for creative work.

Sometimes, the software doesn’t meet the user needs due to poor UX or misunderstood requirements.

Moreover, the lack of visual and style consistency from the Designer agent can be jarring. This occurs because it still remains difficult to generate visual assets that are consistent with a given style or brand across runs (this may be addressed with LoRAs now).

Screenshot — Example Gomoku game generated by ChatDev — from ChatDev paper

The researchers also highlighted current biases with LLMs, which leads to the generation of code that does not look like anything a human developer may write.

Finally, the researchers remarked that it is difficult to fully assess the software produced by ChatDev using their resources. A true evaluation of the produced applications would require the participation of humans, ranging from:

  • software engineers
  • designers/ux experts
  • testers
  • users

Personal critique

Personally, I would also like to express my reservations with some of this work, despite it being a very exciting development.

Firstly, most software teams these days operate under the Agile development method, which allows for more flexibility in the face of changing user requirements for instance. Waterfall, while used for some projects, is not the norm these days. It would be interesting to see how ChatDev could be iterated on to embrace a more dynamic software development lifecycle.

I would recommend we replace inception prompting with a more direct and refined prompt that comes directly from the user. Inception prompting could make up requirements or not fully capture the intent of the end user.

The model used at the time (gpt 3.5 turbo) only had a 16K tokens context window, which severely limits the scope and complexity of applications that could be built using ChatDev.

It seems the code produced by ChatDev is not executed inside a sandbox, but directly on the user’s machine. This poses many security risks that should be addressed in the future.

Animation — ChatDev Visualisation Tool — from ChatDev source code

ChatDev didn’t really work for me. When I tried to run it to generate a chess game, it certainly produced some code, but upon running it I just saw a blank desktop application. This could have been because I was on Python 3.12, whereas Python 3.8 is used in the paper.

Closing Thoughts

ChatDev represents an exciting step towards realising the vision of building agentic AI systems for software development. By using a multi-phase process that leverages large language models with memory, reflection capabilities, ChatDev demonstrates the potential for efficient and cost-effective software generation.

While there are still challenges to overcome, such as addressing the underlying language model’s biases and ensuring systematic robustness evaluations, the ChatDev paradigm represents a glimpse into the exciting possibilities that lie ahead as we continue to push the boundaries of what AI can achieve.

If you’re curious about AI Agents and would like to explore this field further, I highly recommend giving the ChatDev paper a read. You can access it here.

Additionally, the researchers have open-sourced a diverse dataset named SRDD (Software Requirement Description Dataset) to facilitate research in the creation of software using natural language. You can find the dataset here.

As for me, I will continue my exploration of AI Agents; tinkering with my own Python AI Agent library, reading more papers, and sharing my thoughts and discoveries through daily posts on Twitter/X.

Feel free to follow me there to join the conversation and stay updated on the latest developments in this exciting field!