Cognitive Core

The Cognitive Core is the central component of a VIRTUAL, leveraging a Large Language Model (LLM) for task execution and embodying the VIRTUAL's unique personality & central intelligence.

About the Large Language Models (LLMs)

The current LLM leverages on open sourced models. Each Virtual personality and central intelligence are being incorporated using the approach below:

  • Personality Development The backstory, lore, personality traits, and characteristics of each Virtual are developed using the Retrieval-Augmented Generation (RAG) method. This approach combines the generative capabilities of a language model with a retrieval mechanism, allowing the AI to pull in relevant information from a knowledge base to enrich its responses. This technique is particularly effective for creating a Virtual's unique and engaging personality, as it can draw upon a wide range of data to make the character's interactions more diverse and lifelike.

  • Central Intelligence

    For Virtuals with substantial datasets, direct finetuning of open source model is employed. This process involves adjusting the model's parameters specifically for the large dataset, enhancing its ability to respond accurately and effectively in the context of the Virtual's designated domain. Instruction-based finetuning is applied as necessary. This involves training the model to follow specific instructions or guidelines, further refining its responses and actions according to predefined rules or objectives. If the dataset is smaller, the information is stored in a vector database. This data is then fed into the model using the RAG method, allowing the AI to access this more limited set of information efficiently.

Data Pre-processing

In today's diverse data landscape, relevant datasets come in various formats: text (from textbooks, forums, wikis), videos, and audio. Currently, the central character core predominantly relies on text-based Large Language Models (LLMs) and, as such, primarily incorporates text-based training data. Consequently, if training data exist in non-textual formats like videos or audio, they need to be transcribed into text for model training. Standard data processing rules will be applied prior to model training.

  • Data Cleaning: In this step, datasets are cleaned to remove any noise and nullities. Data rules are applied to maintain data integrity and improve data quality.

  • Data Transformation: Datasets undergo transformation and standardization to become interpretable and usable for model training.

Remembering user conversation for better user experience

The Virtual is engineered with a persistent memory system, aiming to closely mimic human-like memory capabilities and facilitate personalized interactions with users. To accomplish this, the system addresses two primary challenges:

  1. User and Conversation Identification and Recall:

    • The system is designed to reliably identify each user and their respective conversations, ensuring the ability to remember and reference these interactions accurately.

  2. Long Conversation Storage and Memory Processing:

    • Managing and storing extended conversations presents a challenge in terms of memory processing. The system is tailored to handle these long dialogues efficiently.

Unique Identifier

Each user engaging with a Virtual is assigned a unique identifier. This identifier is pivotal for maintaining conversation continuity and user specificity.

A Sample Database Table

A sample database table is formed as below.

    message_id VARCHAR(32) NOT NULL PRIMARY KEY,
    conversation_id VARCHAR(32) NOT NULL,
    user_id VARCHAR(32) NOT NULL,
    prompt TEXT NOT NULL,
    timestamp DATETIME NOT NULL,
    response TEXT, 
    FOREIGN KEY (conversation_id) REFERENCES Conversations(conversation_id)

Vector Database

Messages are vectorised using embedding techniques. This vectorisation process transforms textual messages into numerical vector formats, suitable for efficient storage and retrieval.

When the getPrompt('identifier', 'context', 'params') function is called, the system uses the user identifier to retrieve all associated messages from the vector database. It employs a retrieval method to process these vectors within the Large Language Model (LLM), enabling the LLM to understand the conversation's context without additional context inputs from the dApp. The LLM generates responses based on the retrieved conversation history. This approach ensures that responses are both contextually relevant and personalized to each user's ongoing conversation thread.

Learn more about contributing to Cognitive Core

pageContribute to Cognitive Core

Last updated