Introducing VerifAI's MultiLLM open-source framework

Harnessing the power of Multiple Large LLMs

VerifAI's Python open-source MultiLLM framework calls LLMs in parallel and ranks their outputs to find the best results (ground truth).
The first use case is comparing code produced by GPT3,5 and Google-Bard. MultiLLM can be extended to support new LLMs and custom ranking function to evaluate a variety of outputs from LLMs.

The world of natural language processing has witnessed remarkable advancements in recent years, thanks to the proliferation of large language models (LLMs). These LLMs can understand context, generate human-like text, and perform various language-related tasks with astonishing accuracy. However, LLMs can produce incorrect results and hallucinate. That's where the MultiLLM class steps in, offering a solution to concurrently invoke and manage multiple LLMs while efficiently ranking their output to achieve the ground truth.

Unveiling VerifAI's open-source Multi LLM Framework

VerifAI's open-source MultiLLM framework provides a powerful and efficient solution for invoking multiple large language models (LLMs) concurrently and raking their outputs to get the results closest to the ground truth. By leveraging the capabilities of several LLMs together, developers and researchers can address complex tasks more effectively than ever before.

VerifAI MultiLLM Architecture - Invokes multiple LLM and Ranks the output

Quick Start

Getting started with the MultiLLM class is straightforward.

Download and install Requirements.txt

pip3 install -r requirements.txt

requirements.txt

58 Bytes

Install requirements.txt and the multillm package by executing:

pip3 install -r requirements.txt
pip3 install multillm

Download and edit the config.json file

Change "credentials" to your Google and OpenAI credential.json files.

config.json

697 Bytes

In config.json edit "credentials": "<add-your-path>/credentials.json"

config.json File

{
  "Config": {
    "MultiLLM": {
      "rank_callback_file": "example_rank_callback.py",
      "llms": [
        {
          "file": "models/bard.py",
          "class_name": "BARD",
          "model": "chat-bison@001",
          "credentials": "application_default_credentials.json"
        },
        {
          "file": "models/GPT.py",
          "class_name": "GPT",
          "model": "gpt-3.5-turbo",
          "credentials": "/openai/key.json"
        }
      ]
    }
  }
}

Example google-app-credentials.json

 {
  "client_id": "123489-6qr4p.apps.googleusercontent.com",
  "client_secret": "fx-d3456-tryf0g9f9",
  "quota_project_id": "my-llm-training",
  "refresh_token": "1-34GFH89KLwe-eft",
  "type": "authorized_user"
  }

Example openai-credentials.json

 {
    "organization" : "org-jc8901FDLI0267",
    "api_key" : "rt-067FGDiTL834"
}

Execute the following command to run a basic example:

multillm -c config.json -prompt "write a python function to sort an array of a billion integers"

Example Output of ↑ multillm command

multillm response Ranking:
1. GPT
2. BARD

Explanation:

"GPT's implementation of sorting an array is simpler and more concise. It uses the built-in sorted function in Python, which has a time complexity of O(n log n). On the other hand, BARD's implementation uses the merge sort algorithm, which has a time complexity of O(n log n) as well. However, BARD's implementation is longer and more complex, requiring additional helper functions. Therefore, GPT's implementation is ranked higher"

These example showcases the core functionality of the MultiLLM class, invoking language models concurrently and efficiently processing their responses.

Example 1: Newton's Method

The below examples showcase the command, we ask MultiLLM to generate code from BARD and GPT and to compare the results:

multillm -c config.json -prompt "write a python function to find a root of the function f using Newton's method"

Output of Example 1

loading module bard...
finished loading module bard
registered model BARD <bard.BARD object at 0x10e4e5b70>
loading module GPT...
finished loading module GPT
registered model GPT <GPT.GPT object at 0x10e4e5d20>
loaded llms: {'bard': <module 'bard' from '/Users/Verifai/models/bard.py'>, 'GPT': <module 'GPT' from '/Users/Verifai/models/GPT.py'>}
calling model: BARD
calling model: GPT
model chat-bison@001

BARD Response:

def newton(f, df, x0, tol=1e-6, maxiter=100):
    """
    Find a root of the function f using Newton's method.

    Args:
        f: The function to find a root of.
        df: The derivative of f.
        x0: The initial guess for the root.
        tol: The tolerance for convergence.
        maxiter: The maximum number of iterations to perform.
        

    Returns:
        The root of f, or None if no root was found.
    """

    for i in range(maxiter):
        x1 = x0 - f(x0) / df(x0)
        if abs(x1 - x0) < tol:
            return x1
        x0 = x1
    return None

GPT Response:

def newton_method(f, f_prime, initial_guess, tol=1e-6, max_iter=100):
    """
    Newton's method for finding the root of a function.

    Parameters:
        f (function): The function for which the root is to be found.
        f_prime (function): The derivative of f.
        initial_guess (float): The initial guess for the root.
        tol (float): The desired tolerance (default 1e-6).
        max_iter (int): The maximum number of iterations (default 100).

    Returns:
        float: The root found by Newton's method, or None if no root is found.
    """
    x = initial_guess
    for _ in range(max_iter):
        fx = f(x)
        if abs(fx) < tol:
            return x
        fpx = f_prime(x)
        if fpx == 0:
            return None
        x -= fx / fpx
    return None

MultiLLM Explanation and Ranking for Example 1:

"The GPT implementation of the Newton's method function (newton_method) is well-documented with clear parameter explanations and return format. It also includes appropriate input validation checks and handles the case when the derivative is 0. Overall, it is a comprehensive and robust implementation."
"The BARD implementation of the Newton's method function (newton) is also well-documented and accepts the necessary parameters. However, it lacks input validation checks, such as when the derivative is 0, and does not provide an explicit return value when no root is found. It can be improved by addressing these limitations."

"Based on these factors, I would rank GPT higher than BARD in terms of the clarity, completeness, and robustness of the implementation."

Example 2: Primes

We ask MultiLLM to write a function to detect prime numbers, and to compare results from BARD and GPT :

multillm -c config.json -prompt "Write me a function to determine whether a number is prime"

Output of Example 2

loading module bard...
finished loading module bard
registered model BARD <bard.BARD object at 0x11e881b70>
loading module GPT...
finished loading module GPT
registered model GPT <GPT.GPT object at 0x11e881d20>
loaded llms: {'bard': <module 'bard' from '/Users/frank/Verifai/repo/verifai-ai/src/verifai/llm/multi_llm/models/bard.py'>, 'GPT': <module 'GPT' from '/Users/frank/Verifai/repo/verifai-ai/src/verifai/llm/multi_llm/models/GPT.py'>}
calling model: BARD
calling model: GPT
model chat-bison@001

BARD Response:

def is_prime(n):
  """Returns True if n is prime, False otherwise."""
  if n < 2:
    return False
  for i in range(2, int(n ** 0.5) + 1):
    if n % i == 0:
      return False
  return True

GPT Response:

def is_prime(number):
    if number < 2:
        return False
    for i in range(2, int(number**0.5) + 1):
        if number % i == 0:
            return False
    return True

MultiLLM Explanation and Ranking for Example 2:

"BARD's implementation of the isprime function is identical to GPT's implementation. Both functions check if a number is prime using the same algorithm. Therefore, both LLMs share the same ranking."
"GPT's implementation of the isprime function is correct and follows the standard algorithm for checking prime numbers. While the implementation is identical to BARD's, it still demonstrates proficiency in coding logic and knowledge of basic mathematical operations."

Example 3: Binary Sort

We ask MultiLLM to write a binary sort function, and to compare results from BARD and GPT :

multillm -c config.json -prompt "write a python function to do a binary sort"

Output of Example 3

loading module bard...
finished loading module bard
registered model BARD <bard.BARD object at 0x1113b1b70>
loading module GPT...
finished loading module GPT
registered model GPT <GPT.GPT object at 0x1113b1d20>
loaded llms: {'bard': <module 'bard' from '/Users/frank/Verifai/repo/verifai-ai/src/verifai/llm/multi_llm/models/bard.py'>, 'GPT': <module 'GPT' from '/Users/frank/Verifai/repo/verifai-ai/src/verifai/llm/multi_llm/models/GPT.py'>}
calling model: BARD
calling model: GPT
model chat-bison@001

BARD Response:

def binary_search(array, target):
    left = 0
    right = len(array) - 1

    while left <= right:
        mid = (left + right) // 2

        if array[mid] == target:
            return mid
        elif array[mid] < target:
            left = mid + 1
        else:
            right = mid - 1

    return -1

GPT Response:

def binary_sort(arr):
    if len(arr) <= 1:
        return arr
    mid = len(arr) // 2
    left_half = arr[:mid]
    right_half = arr[mid:]

    left_half = binary_sort(left_half)
    right_half = binary_sort(right_half)

    return merge(left_half, right_half)
 
 def merge(left, right):
     merged = []
     i = j = 0
     while i < len(left) and j < len(right):
         if left[i] < right[j]:
             merged.append(left[i])
             i += 1
        else:
             merged.append(right[j])
             j += 1
     while i < len(left):
         merged.append(left[i])
         i += 1
     while j < len(right):
         merged.append(right[j])
         j += 1

     return merged

MultiLLM Explanation and Ranking for Example 3:

"The GPT solution implements a recursive approach to merge sort. It splits the array into smaller halves, sorts them recursively, and then merges the sorted halves. This approach has a time complexity of O(n log n) and is efficient for large datasets."
"The BARD solution implements a binary search algorithm. It searches for a target value in a sorted array by repeatedly dividing the search space in half. This algorithm has a time complexity of O(log n) and is efficient for finding a specific element in a sorted array. The implementation is correct and returns the index of the target if found, or -1 if not found. However, it does not involve ranking or sorting other LLMs, which is the task at hand."

Exploring Use Cases

The MultiLLM class proves its worth across a variety of use cases, enabling users to harness the combined capabilities of multiple LLMs in diverse scenarios. Here are a few scenarios where the application shines:

Complex Query Resolution: When dealing with intricate queries or prompts, using a single LLM might not yield the most accurate or comprehensive results. By leveraging multiple LLMs simultaneously, the MultiLLM class can enhance the quality of responses by aggregating insights from different models.

Action Chains: Often, processing raw LLM outputs directly might not yield the desired results. Action chains provide a mechanism to preprocess LLM responses using a sequence of actions. This empowers users to refine and enhance the output, resulting in more polished and relevant content.

Ranking Aggregation: Combining the outputs of multiple LLMs can be a daunting task. The MultiLLM class includes a Rank class that allows users to modify and rank the combined LLM outputs, making it easier to identify the most relevant information.

Architecture and Components

The architecture of the MultiLLM class is designed for flexibility, modularity, and efficiency. Here's a brief overview of its key components:

MultiLLM Class invokes multiple LLM's and the result are ranked

MultiLLM Class

At the heart of the application is the MultiLLM class, responsible for orchestrating the concurrent execution of multiple language models. It allows developers to define and load LLMs from a configuration file, run them in parallel, and process the outputs efficiently.

BaseLLM

The BaseLLM class provides a structured foundation for implementing specific language model classes. It includes essential attributes and methods necessary for interfacing with language models, making it easier to develop and integrate new LLM implementations.

Action Class

The Action class provides a framework for defining actions that can be applied to modify data in a serial manner. These actions can be used to preprocess LLM outputs, such as refining responses, extracting specific information, or applying transformations.

Rank Class

The Rank class, similar to the Action class, allows developers to define actions that operate on the final combined output of multiple LLMs. This class is particularly useful for ranking and aggregating the results from different models.

Contributing to Multi LLM

The MultiLLM class and applications is an open-source project, and the VerifAI team welcomes contributions from the community. If you're interested in extending the application's capabilities, adding new language models, or enhancing its existing features, you can do so by extending the models provided in the models directory. Check out the BaseLLM section for guidance on creating your own custom LLM implementations.

Conclusion

The Multi LLM application serves as a testament to the VerifAI team's dedication to optimizing the usage of large language models. By bringing together the power of multiple LLMs, developers can tackle complex language processing tasks with greater efficiency and precision. The application's modular architecture, coupled with the ability to define action chains and rank results, makes it a versatile tool for various use cases. As natural language processing continues to evolve, the MultiLLM class and application is poised to play a pivotal role in driving advancements in the field.

VerifAI's open-source MultiLLM Framework