Primitive Obsession in RAG Pipelines: A Refactoring Journey

Break Free from Primitive Obsession: Clean Code Starts Here

At numberz.ai, we believe in crafting clean, expressive, and testable code to ensure robust pipelines, especially in complex systems like Retrieval-Augmented Generation (RAG). As we tackle code smells in various stages of development, one of the most common yet subtle offenders is Primitive Obsession.

Primitive Obsession might not stand out at first glance, but it has the potential to degrade code quality and hinder maintainability over time. Drawing inspiration from best practices like those from Martin Fowler’s Refactoring and Andrew Hunt’s The Pragmatic Programmer, we are constantly looking for ways to refactor our code to make it more readable, maintainable, and scalable.

What is Primitive Obsession?

Primitive Obsession occurs when basic data types like strings, integers, or lists are used to represent more complex concepts in a system, rather than encapsulating them in custom types. This leads to overly generic, hard-to-read code where meaning and validation logic are scattered everywhere. Instead of using meaningful abstractions, developers lean on primitives, making the code less expressive and more prone to errors.

Example:

Imagine you are building a RAG pipeline. Each step—document parsing, chunking, classification, and retrieval—relies on various inputs and outputs. If you use raw strings, lists, or integers to represent different parts of the pipeline, it quickly becomes unclear what those values represent, leading to possible errors and repeated logic across different stages.

def process_pipeline(raw_text: str):
    chunks = chunk_text(raw_text)
    classified_chunks = classify_chunks(chunks)
    results = retrieve_documents(classified_chunks)
    return generate_response(results)

Here, raw_text and chunks are just strings and lists. There’s no clarity on what chunk_text() returns or what specific behavior belongs to classified_chunks.

The Cost of Primitive Obsession in RAG Pipelines

Using primitives in complex pipelines like RAG leads to several issues:

Lack of Clarity: The meaning behind each value is obscured, making it difficult to follow what the code is doing at each stage.
Redundant Logic: Without encapsulating validation or transformation logic in custom types, these operations get repeated across different parts of the pipeline.
Fragility: Any change in the pipeline’s logic can lead to errors, as primitives don’t offer the flexibility and safety of custom types.

In a RAG pipeline, this could result in improper chunking of documents, failed classifications, or even wrong document retrievals due to misinterpreted values.

How to Refactor: Embrace Domain-Specific Types

The antidote to Primitive Obsession is creating domain-specific types. These types encapsulate both data and logic, providing clear intent and behavior at each stage of the pipeline.

Refactored Example:

Instead of using strings or lists, create meaningful abstractions that represent key concepts in your pipeline, such as RawDocument, Chunk, ClassifiedChunk, and Query.

class RawDocument:
    def __init__(self, content: str):
        self.content = content
        self.validate()

    def validate(self):
        if not self.content:
            raise ValueError("Document content cannot be empty")

class Chunk:
    def __init__(self, content: str):
        self.content = content

class ClassifiedChunk:
    def __init__(self, chunk: Chunk, classification: str):
        self.chunk = chunk
        self.classification = classification

Now, each step in the RAG pipeline is more structured:

def process_pipeline(document: RawDocument):
    chunks = chunk_document(document)
    classified_chunks = classify_chunks(chunks)
    results = retrieve_documents(classified_chunks)
    return generate_response(results)

def chunk_document(document: RawDocument) -> List[Chunk]:
    # Logic for chunking
    return [Chunk(part) for part in split_text(document.content)]

By creating custom classes, you ensure each step handles only its domain-specific object, making the code much clearer and safer to use.

The Impact of Refactoring

Refactoring the pipeline to eliminate Primitive Obsession has numerous benefits:

Clarity: Domain-specific types make the pipeline more readable and self-documenting. Instead of generic strings or lists, you know exactly what a Chunk or ClassifiedChunk is and how it should be used.
Consistency: Instead of scattered validation logic, custom types handle validation internally, ensuring consistency across the pipeline.
Extensibility: If you need to modify the behavior of, say, how chunks are classified, you can easily do so in the ClassifiedChunk class, rather than updating scattered code.

Primitive Obsession in Complex RAG Pipelines: A Real-World Example

Let’s return to the earlier RAG pipeline. Imagine you are processing multiple stages, from text chunking to document retrieval, and you use raw types for these operations. Over time, as your pipeline grows and business requirements change, you’re forced to scatter fixes and validation logic everywhere.

With domain-specific types in place, however, your pipeline remains manageable. If you need to change the chunking logic or add new classification rules, you do so by refining the respective classes, maintaining clarity and consistency throughout.

Conclusion

Primitive Obsession may seem harmless, but in complex systems like a RAG pipeline, it can quickly lead to confusing, error-prone code. By refactoring with domain-specific types, you not only clarify the meaning behind your data but also make your code easier to maintain and scale over time.

At numberz.ai, we embrace merciless refactoring to keep our code lean, expressive, and testable. By eliminating Primitive Obsession, we ensure that every stage of our RAG pipelines is robust, scalable, and easy to understand. The journey doesn’t end here—stay tuned for future insights into other code smells and how to refactor them for better system design.

Next in the Series

In the next post, we will dive into another common pitfall—Data Clumps—and explore how breaking up tightly bound data structures can lead to more modular and testable pipelines.

learn with numberz.ai