Building an Academic Research Assistant with OpenAI's Agents SDK

Learn how to build a sophisticated AI assistant that helps researchers access, organize, and analyze academic literature using OpenAI's new Agents SDK. This multi-agent system streamlines the research process through specialized components.

AI Development
25 min read
Building an Academic Research Assistant with OpenAI's Agents SDK

Building an Academic Research Assistant with OpenAI's Agents SDK

In the rapidly evolving landscape of artificial intelligence, OpenAI has recently introduced the Agents SDK, a powerful tool that enables developers to build agentic AI applications with minimal abstractions. This SDK represents a significant advancement in creating AI systems that can reason, plan, and take actions to accomplish complex tasks.

In this article, we'll explore how to build a sophisticated Academic Research Assistant using OpenAI's Agents SDK. This agent will help researchers access, organize, and analyze academic literature, saving valuable time and enhancing the research process.

Understanding OpenAI's Agents SDK

Before we dive into building our Research Assistant, let's understand the core concepts of OpenAI's Agents SDK:

Core Components

  1. Agents: LLMs equipped with specific instructions and tools. They represent AI models configured with specialized capabilities, knowledge, and behaviors.

  2. Handoffs: Allow agents to delegate tasks to other specialized agents. This creates a modular system where each agent excels at particular tasks.

  3. Guardrails: Enable input validation, ensuring agents operate within defined boundaries. They can include "tripwires" that halt execution when triggered.

  4. Tracing: Built-in capabilities to visualize and debug agent flows, essential for monitoring behavior during development and production.

The Agents SDK is designed with simplicity and flexibility in mind, offering enough features to be valuable while keeping primitives minimal for quick learning.

The Academic Research Assistant: Use Case Overview

Our Academic Research Assistant will help researchers:

  1. Search for relevant literature using semantic search across academic databases
  2. Summarize papers to quickly grasp key findings
  3. Organize research into structured notes and bibliographies
  4. Answer questions about specific topics using the latest research
  5. Generate research insights by analyzing patterns across multiple papers

This assistant will be particularly valuable for researchers dealing with the overwhelming volume of academic literature published daily, helping them stay current in their field without sacrificing depth of understanding.

Project Structure

For this project, we'll organize our code in a modular structure. Here's how our research-assistant-agent folder is structured. The complete code is available on GitHub at github.com/aubreyzulu/portifolio/tree/main/portifolio/research-assistant-agent.

research-assistant-agent/
├── requirements.txt          # Dependencies for the project
├── research_assistant.py     # Core implementation with agents and tools
├── run.py                    # Command-line interface launcher
├── web_interface.py          # Flask-based web UI implementation
├── example.py                # Example usage scenarios
└── README.md                 # Project documentation

Each file has a specific purpose:

  1. requirements.txt: Lists all dependencies, including openai-agents, scholarly, pymupdf, urllib3, pandas, and Flask.

  2. research_assistant.py: Contains the main implementation of our research assistant, including:

    • Tool implementations (search_papers, extract_paper_text, etc.)
    • Agent definitions (search_agent, summary_agent, etc.)
    • Memory management for contextual conversations
    • Guardrails for input validation
    • Core functionality to run the research assistant
  3. run.py: Provides a command-line interface to:

    • Check dependencies and install if missing
    • Verify the OpenAI API key is set
    • Launch either the CLI or web interface
    • Handle user interaction in command-line mode
  4. web_interface.py: Implements a Flask-based web interface for:

    • A user-friendly way to interact with the research assistant
    • Handling research queries asynchronously
    • Displaying formatted results
  5. example.py: Demonstrates usage scenarios with sample research queries.

Project Setup

Let's start by setting up our project. We'll need to install the OpenAI Agents SDK and other dependencies:

# File: /research-assistant-agent/setup.sh
# Create a virtual environment
python -m venv research-assistant-env
source research-assistant-env/bin/activate  # On Windows: research-assistant-env\Scripts\activate

# Install dependencies
pip install openai-agents
pip install scholarly  # For accessing Google Scholar
pip install pymupdf  # For PDF processing
pip install urllib3  # For HTTP requests
pip install pandas  # For data organization

Don't forget to set up your OpenAI API key:

# File: /research-assistant-agent/.env
OPENAI_API_KEY=your-api-key-here

Let's now dive into building the research assistant components:

Building the Research Assistant Agent System

Now, let's build our Academic Research Assistant using a multi-agent approach. While our article shows the components in separate files for clarity, the actual project implementation consolidates most of these components in

research_assistant.py
for simplicity.

1. Define Agent Specializations

We'll create four specialized agents:

  1. Search Agent: Responsible for finding relevant papers
  2. Summary Agent: Creates concise summaries of academic papers
  3. Organization Agent: Structures research findings
  4. Analysis Agent: Identifies patterns and generates insights

2. Implementation of the Agents

Let's implement our main agent system:

# File: /research-assistant-agent/tools.py
from agents import Agent, Runner, function_tool, Handoff
import scholarly
import fitz  # PyMuPDF
import pandas as pd
import urllib.request
import json
import os

# Tool to search for academic papers
@function_tool
def search_papers(query: str, num_results: int = 5):
    """Search for academic papers on a given topic."""
    search_query = scholarly.search_pubs(query)
    results = []
    
    for i in range(min(num_results, 10)):  # Limit to max 10 papers
        try:
            paper = next(search_query)
            results.append({
                "title": paper.get("bib", {}).get("title", "Unknown Title"),
                "authors": ", ".join([author for author in paper.get("bib", {}).get("author", [])]),
                "year": paper.get("bib", {}).get("pub_year", "Unknown Year"),
                "abstract": paper.get("bib", {}).get("abstract", "No abstract available"),
                "url": paper.get("pub_url", "No URL available"),
                "citations": paper.get("num_citations", 0)
            })
        except StopIteration:
            break
    
    return results

# Tool to download and extract text from PDF
@function_tool
def extract_paper_text(pdf_url: str):
    """Download a paper PDF and extract its text content."""
    if not pdf_url.endswith('.pdf'):
        return "The URL does not point to a PDF file."
    
    try:
        # Create a temp directory if it doesn't exist
        os.makedirs("temp", exist_ok=True)
        
        # Download the PDF
        local_file = os.path.join("temp", "paper.pdf")
        urllib.request.urlretrieve(pdf_url, local_file)
        
        # Extract text
        text = ""
        with fitz.open(local_file) as doc:
            for page in doc:
                text += page.get_text()
        
        return text
    except Exception as e:
        return f"Error extracting text: {str(e)}"

# Tool to organize research notes
@function_tool
def organize_notes(title: str, authors: str, year: str, summary: str, key_findings: list, bibliography_format: str = "APA"):
    """Organize research notes into a structured format."""
    notes = {
        "title": title,
        "authors": authors,
        "year": year,
        "summary": summary,
        "key_findings": key_findings,
    }
    
    # Generate citation
    if bibliography_format.upper() == "APA":
        citation = f"{authors} ({year}). {title}."
    elif bibliography_format.upper() == "MLA":
        citation = f"{authors}. \"{title}.\" {year}."
    else:
        citation = f"{authors}, {title}, {year}."
    
    notes["citation"] = citation
    
    return notes
# File: /research-assistant-agent/agents.py
from agents import Agent
from tools import search_papers, extract_paper_text, organize_notes

# Create specialized agents
search_agent = Agent(
    name="Literature Search Specialist",
    instructions="""You are an expert at finding relevant academic papers.
    Your task is to search for papers based on the user's query and return the most relevant results.
    Prioritize recent papers, highly cited works, and those from reputable sources.
    Be thorough in your search and provide comprehensive information about each paper found.""",
    tools=[search_papers],
)

summary_agent = Agent(
    name="Paper Summarization Expert",
    instructions="""You are an expert at summarizing academic papers.
    Your task is to create concise yet comprehensive summaries of academic papers.
    Focus on extracting key findings, methodology, results, and conclusions.
    Maintain academic rigor while making the content accessible.
    Use clear language and organize the summary logically.""",
    tools=[extract_paper_text],
)

organization_agent = Agent(
    name="Research Organization Specialist",
    instructions="""You are an expert at organizing research materials.
    Your task is to structure research notes, create bibliographies, and organize information.
    Follow academic standards and ensure all information is properly cited.
    Create clear, logical structures for organizing complex research information.""",
    tools=[organize_notes],
)

analysis_agent = Agent(
    name="Research Analysis Expert",
    instructions="""You are an expert at analyzing research findings and generating insights.
    Your task is to identify patterns, connections, and contradictions across multiple papers.
    Generate potential research questions based on gaps in the literature.
    Provide critical analysis of methodologies and conclusions in papers.
    Help connect findings to broader theoretical frameworks.""",
    tools=[],  # Using built-in reasoning capabilities
)

# Create main research assistant agent with handoffs to specialized agents
research_assistant = Agent(
    name="Academic Research Assistant",
    instructions="""You are an academic research assistant that helps researchers access, 
    organize, and analyze academic literature. You can search for papers, summarize them,
    organize research notes, and analyze findings to generate insights.
    
    When a user asks a research question, first understand what they're looking for,
    then delegate to the appropriate specialized agent. Maintain a helpful, professional tone
    and provide accurate academic information.
    
    For literature searches, delegate to the Literature Search Specialist.
    For paper summaries, delegate to the Paper Summarization Expert.
    For organizing research, delegate to the Research Organization Specialist.
    For analysis and insights, delegate to the Research Analysis Expert.
    
    Ensure all responses maintain academic standards and provide proper citations.
    """,
    handoffs=[search_agent, summary_agent, organization_agent, analysis_agent],
)

3. Setting Up Guardrails

Guardrails help ensure our agent operates within defined parameters and handles input appropriately:

# File: /research-assistant-agent/guardrails.py
from agents import Guardrail, GuardrailResponse, Runner
from agents import research_assistant

# Define guardrails for input validation
def validate_research_query(input_text):
    # Check if query is too vague
    if len(input_text.split()) < 3:
        return GuardrailResponse(
            valid=False,
            failure_reason="Query is too vague. Please provide a more specific research question."
        )
    
    # Check for potentially sensitive topics
    sensitive_topics = ["classified", "confidential", "proprietary", "plagiarism", "write my paper for me"]
    for topic in sensitive_topics:
        if topic in input_text.lower():
            return GuardrailResponse(
                valid=False,
                failure_reason=f"Your query contains sensitive content ('{topic}'). Please reformulate your question."
            )
    
    return GuardrailResponse(valid=True)

# Create guardrail
input_guardrail = Guardrail(name="Research Query Validator", check=validate_research_query)

# Apply guardrail to research assistant
research_assistant_with_guardrails = Agent(
    name="Academic Research Assistant",
    instructions=research_assistant.instructions,
    handoffs=research_assistant.handoffs,
    input_guardrails=[input_guardrail]
)

4. Running the Research Assistant

Now let's create a simple interface to interact with our research assistant:

# File: /research-assistant-agent/main.py
from agents import Runner
from guardrails import research_assistant_with_guardrails

def run_research_assistant(query):
    """Run the research assistant with a given query."""
    result = Runner.run_sync(
        starting_agent=research_assistant_with_guardrails,
        input=query
    )
    return result.final_output

# Example usage
if __name__ == "__main__":
    print("Academic Research Assistant")
    print("---------------------------")
    print("Enter your research query or type 'exit' to quit.")
    
    while True:
        query = input("\nResearch query: ")
        if query.lower() == 'exit':
            break
        
        result = run_research_assistant(query)
        print("\nResearch Assistant Response:")
        print(result)

Enhancing the Agent with Advanced Features

1. Implementing Tracing for Performance Monitoring

Tracing helps us visualize and debug the agent's flow:

# File: /research-assistant-agent/tracing.py
from agents.tracing import setup_tracing, SpanProcessor

class CustomSpanProcessor(SpanProcessor):
    def process_span(self, span):
        # Log the span data
        print(f"[TRACE] {span.name}: {span.data}")

# Set up tracing
setup_tracing(processors=[CustomSpanProcessor()])

2. Adding Memory for Contextual Conversations

Let's add a simple in-memory storage to retain context between interactions:

# File: /research-assistant-agent/memory.py
import json

class ResearchMemory:
    def __init__(self):
        self.papers_found = []
        self.summaries = {}
        self.notes = {}
        self.current_topic = None
    
    def add_paper(self, paper):
        self.papers_found.append(paper)
    
    def add_summary(self, paper_title, summary):
        self.summaries[paper_title] = summary
    
    def add_notes(self, paper_title, notes):
        self.notes[paper_title] = notes
    
    def set_topic(self, topic):
        self.current_topic = topic
    
    def get_papers(self):
        return self.papers_found
    
    def get_summary(self, paper_title):
        return self.summaries.get(paper_title, "No summary available")
    
    def get_notes(self, paper_title):
        return self.notes.get(paper_title, "No notes available")
    
    def get_context(self):
        return {
            "current_topic": self.current_topic,
            "papers_found": len(self.papers_found),
            "papers_summarized": len(self.summaries),
            "notes_created": len(self.notes)
        }

# Initialize memory
research_memory = ResearchMemory()
# File: /research-assistant-agent/main_with_memory.py
from agents import Runner
from guardrails import research_assistant_with_guardrails
from memory import research_memory
import json

# Modify the run function to use memory
def run_research_assistant_with_memory(query):
    # Update memory with current topic
    if research_memory.current_topic is None:
        research_memory.set_topic(query)
    
    # Add context from memory
    context = f"Previous research context: {json.dumps(research_memory.get_context())}"
    full_query = f"{context}\n\nNew query: {query}"
    
    result = Runner.run_sync(
        starting_agent=research_assistant_with_guardrails,
        input=full_query
    )
    
    # Here you would parse the result to update memory
    # This is simplified for the example
    
    return result.final_output

Advanced Use Cases

Our Academic Research Assistant can be extended for several specialized research tasks:

1. Literature Review Automation

# File: /research-assistant-agent/advanced_tools.py
from agents import function_tool

@function_tool
def generate_literature_review(topic: str, papers: list):
    """Generate a structured literature review from a list of papers."""
    # Implementation would organize papers by themes, identify gaps,
    # and create a coherent narrative of the research landscape
    return "Structured literature review" # Simplified for this example

2. Research Gap Identification

# File: /research-assistant-agent/advanced_tools.py
@function_tool
def identify_research_gaps(papers: list):
    """Analyze a collection of papers to identify gaps in the literature."""
    # Implementation would compare methodologies, findings, and study populations
    # to highlight understudied areas
    return "Identified research gaps" # Simplified for this example

3. Cross-Discipline Connection

# File: /research-assistant-agent/advanced_tools.py
@function_tool
def find_cross_discipline_connections(primary_field: str, papers: list):
    """Identify connections between the primary research field and other disciplines."""
    # Implementation would analyze papers for methodologies or findings
    # that could apply to or benefit from other disciplines
    return "Cross-discipline connections" # Simplified for this example

Best Practices and Limitations

While building your Research Assistant, keep these best practices in mind:

  1. Respect Copyright: Ensure your agent respects copyright laws when accessing and processing academic papers.

  2. Citation Accuracy: Double-check citations generated by the agent, as accuracy is crucial in academic contexts.

  3. Human Verification: Researchers should verify the agent's output before using it in their own work.

  4. Transparency: Be clear about when content is AI-generated vs. human-authored.

  5. Privacy Considerations: Handle research data with appropriate privacy controls.

Limitations

Be aware of these limitations:

  1. The agent can only access papers that are publicly available or to which the researcher has legitimate access.

  2. The quality of summaries depends on the clarity of the original text and the agent's understanding.

  3. The agent may not fully grasp highly specialized terminology in niche academic fields.

  4. Citation formats might require manual adjustment for specific academic journals.

Conclusion

Building an Academic Research Assistant using OpenAI's Agents SDK demonstrates the powerful capabilities of agentic AI applications. By combining specialized agents with appropriate tools, guardrails, and tracing, we've created a system that can significantly enhance the research process.

This multi-agent approach showcases how complex workflows can be broken down into specialized tasks, each handled by an expert agent. The handoff mechanism allows for seamless collaboration between these agents, creating a comprehensive research assistant that adapts to the user's needs.

As AI technology continues to evolve, these types of agentic applications will become increasingly valuable in academic contexts, helping researchers navigate the vast landscape of published literature and accelerate the pace of scientific discovery.

Remember that while AI can be an incredibly powerful research tool, it works best as an assistant to human researchers rather than a replacement. The combination of human expertise with AI capabilities offers the most promising path forward for academic research.

Further Resources

The complete code for this project is available in the GitHub repository. The implementation includes a fully functional CLI, a web interface, and example usage scenarios to help you get started with your own research assistant.

To run the project locally:

# Clone the repository
git clone https://github.com/aubreyzulu/portifolio.git
cd portifolio/portifolio/research-assistant-agent

# Install dependencies
pip install -r requirements.txt

# Run the application
python run.py

This will launch the application and allow you to choose between the CLI, web interface, or example usage scenarios.