Okay, don’t know how this actually fits in with agentic chatbots, but I want to look at using a vector database to store document chunks and embeddings. May also after this take a look at graph databases. Though the latter is presently seriously iffy.

We previously chunked a document then stored the embeddings in an in-memory vector store. I thought it would be interesting to take a look at doing this with a vector database. After looking around a bit, I am going to try to use ChromaDB.

Embeddings, vector search, document storage, full-text search, metadata filtering, and multi-modal. All in one place. Retrieval that just works. As it should be.
Chroma

Install Chroma

Fortunately it is available on conda-forge. Unfortunately a large number of files installed and/or updated.

(agnt-3.12) PS R:\learn\ds_agent> conda install -c conda-forge chromadb
Retrieving notices: ...working... done
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

... ...

Proceed ([y]/n)? y

Downloading and Extracting Packages

Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Test Installation

The heading says ‘Test’ but this first bit is most likely playing around with Chroma on a small sample of documents. Well 2 very short documents. The content is taken from the Canadian Speech from the Throne for the 44th and 45th Parliaments. In both cases it is the section titled ‘Opening’.

This did take a bit of fooling around. The simple collection creation in the Chroma docs was not what I wanted. It did not chunk the docs nor generate embeddings for the chunks.

So back to basics.

# chat_bot_5.py: 
# version: 0.1.0: 2025.06.23, rek, look at using vector database, ChromaDB

from dotenv import load_dotenv

import chromadb
from langchain.chains import VectorDBQA
from langchain_core.prompts import ChatPromptTemplate
# from langchain.embeddings import OpenAIEmbeddings
# from langchain_community.embeddings import OpenAIEmbeddings
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_mistralai.chat_models import ChatMistralAI
from langchain_openai import ChatOpenAI, OpenAI
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
# from langchain.vectorstores.chroma import Chroma
from langchain_community.vectorstores import Chroma

load_dotenv()

llm = ChatOpenAI(model="gpt-4o-mini")
embddg = OpenAIEmbeddings()

tst_1 = """## Opening
    When my dear late mother, Queen Elizabeth II, opened a new Canadian Parliament in 1957, the Second World War remained a fresh, painful memory. The Cold War was intensifying. Freedom and democracy were under threat. Canada was emerging as a growing economic power and a force for peace in the world. In the decades since, history has been punctuated by epoch-making events: the Vietnam War, the fall of the Berlin Wall, and the start of the War on Terror. Today, Canada faces another critical moment. Democracy, pluralism, the rule of law, self-determination, and freedom are values which Canadians hold dear, and ones which the Government is determined to protect.

    The system of open global trade that, while not perfect, has helped to deliver prosperity for Canadians for decades, is changing. Canada’s relationships with partners are also changing.

    We must be clear-eyed: the world is a more dangerous and uncertain place than at any point since the Second World War. Canada is facing challenges that are unprecedented in our lifetimes.

    Many Canadians are feeling anxious and worried about the drastically changing world around them. Fundamental change is always unsettling. Yet this moment is also an incredible opportunity. An opportunity for renewal. An opportunity to think big and to act bigger. An opportunity for Canada to embark on the largest transformation of its economy since the Second World War. A confident Canada, which has welcomed new Canadians, including from some of the most tragic global conflict zones, can seize this opportunity by recognising that all Canadians can give themselves far more than any foreign power on any continent can ever take away. And that by staying true to Canadian values, Canada can build new alliances and a new economy that serves all Canadians.
"""

tst_2 = """## Opening
    As we speak, British Columbians are facing immeasurable challenges as their homes, their communities, and their wellbeing are impacted by terrible flooding.
    But in a time of crisis, we know how Canadians respond. We step up and we are there for each other.
    And the Government will continue to be there for the people of British Columbia.

    In 2020, Canadians did not know they would face the crisis of a once-in-a-century pandemic. But, as always, no one should be surprised by how Canadians responded.
    We adapted. We helped one another. And we stayed true to our values.
    Values like compassion, courage, and determination.
    Values like democracy.
    And in this difficult time, Canadians made a democratic choice.

    Their direction is clear: not only do they want Parliamentarians to work together to put this pandemic behind us, they also want bold, concrete solutions to meet the other challenges we face.
    Growing an economy that works for everyone.
    Fighting climate change.
    Moving forward on the path of reconciliation.
    Making sure our communities are safe, healthy, and inclusive.
  
    Yes, the decade got off to an incredibly difficult start, but this is the time to rebuild.
    This is the moment for Parliamentarians to work together to get big things done, and shape a better future for our kids.
"""

# split docs into chunks for embedding
txt_spltr = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=100)
# need LangChain document objects not text strings
tst_docs = txt_spltr.create_documents(
  texts=[tst_1, tst_2],
  metadatas=[{"source": "parlinfo45"}, {"source": "parlinfo44"}],
)
chunks = txt_spltr.split_documents(tst_docs)

# create and persist the Chroma collection for our 2 documents
sftt_coll = Chroma.from_documents(
  documents=chunks,
  embedding=embddg,
  persist_directory="rek",
  collection_name="sftthrone"
)

# Retrieving the context from the DB using similarity search
query_text = "What was said about the pandemic?"
results = sftt_coll.similarity_search_with_relevance_scores(query_text, k=3)
for tdoc, tscr in results:
  print(f"\nscore: {tscr}\npage_content: {tdoc.page_content}\nmetadata: {tdoc.metadata}")

And in the terminal, the above code produced the following.

(agnt-3.12) PS R:\learn\ds_agent> python chat_bot_5.py

score: 0.7307933852148415
page_content: In 2020, Canadians did not know they would face the crisis of a once-in-a-century pandemic. But, as always, no one should be surprised by how Canadians responded.
    We adapted. We helped one another. And we stayed true to our values.
    Values like compassion, courage, and determination.
metadata: {'source': 'parlinfo44'}

score: 0.6953513068860417
page_content: We must be clear-eyed: the world is a more dangerous and uncertain place than at any point since the Second World War. Canada is facing challenges that are unprecedented in our lifetimes.
metadata: {'source': 'parlinfo45'}

score: 0.6784103995557622
page_content: Their direction is clear: not only do they want Parliamentarians to work together to put this pandemic behind us, they also want bold, concrete solutions to meet the other challenges we face.
    Growing an economy that works for everyone.
    Fighting climate change.
metadata: {'source': 'parlinfo44'}

Test Loading Collection from Persisted Datastore

Okay, let’s modify the code to load the collection from the persisted data. A new boolean to control code flow. And a new if/else block.

... ...
load_dotenv()

do_newDB = False

llm = ChatOpenAI(model="gpt-4o-mini")
embddg = OpenAIEmbeddings()

if do_newDB:
  tst_1 = """## Opening
    When my dear late mother, Queen Elizabeth II, opened a new Canadian Parliament in 1957, the Second World War remained a fresh, painful memory. The Cold War was intensifying. Freedom and democracy were under threat.

... ...

  """

  sftt_coll = Chroma.from_documents(
    documents=chunks,
    embedding=embddg,
    persist_directory="rek",
    collection_name="sftthrone"
  )
else:
  client = chromadb.PersistentClient(path='rek')
  print(f"colletcions: {client.list_collections()}")

  sftt_coll = Chroma(
    embedding_function=embddg,
    persist_directory="rek",
    collection_name="sftthrone"
  )

# Retrieving the context from the DB using similarity search
query_text = "What was said about the pandemic?"
results = sftt_coll.similarity_search_with_relevance_scores(query_text, k=3)
for tdoc, tscr in results:
  print(f"\nscore: {tscr}\npage_content: {tdoc.page_content}\nmetadata: {tdoc.metadata}")

When running the above I got the following warning in the terminal (along with the expected output).

R:\learn\ds_agent\chat_bot_5.py💯 LangChainDeprecationWarning: The class `Chroma` was deprecated in LangChain 0.2.9 and will be removed in 1.0. An updated version of the class exists in the :class:`~langchain-chroma package and should be used instead. To use it run `pip install -U :class:`~langchain-chroma` and import as `from :class:`~langchain_chroma import Chroma``.
  sftt_coll = Chroma(

So figured I would get that ’new’ package and go from there before showing you the output. However, during the first attempt, the libmamba solver failed, so I tried the classic solver.

Turns out the issue was likely not with the solver. My conda config had the following:

channel_priority: flexible
channels:
  - conda-forge
  - defaults

So, it was apparently trying the defaults channel before the conda-forge channel. Even though I specified the conda-forge channel in the command. After waiting long enough I got the following.

Retrieving notices: ...working... done
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: E:\appDev\Miniconda3\envs\agnt-3.12

  added / updated specs:
    - langchain-chroma

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    langchain-chroma-0.2.3     |     pyhd8ed1ab_0          17 KB  conda-forge
    ------------------------------------------------------------

The following NEW packages will be INSTALLED:

  langchain-chroma   conda-forge/noarch::langchain-chroma-0.2.3-pyhd8ed1ab_0

Proceed ([y]/n)? y

Downloading and Extracting Packages

Preparing transaction: done
Verifying transaction: done
Executing transaction: done

And after updating the imports, I got the above output (first “test”), without the warning message (see following).

(agnt-3.12) PS R:\learn\ds_agent> python chat_bot_5.py
colletcions: ['sftthrone']

score: 0.7307933852148415
page_content: In 2020, Canadians did not know they would face the crisis of a once-in-a-century pandemic. But, as always, no one should be surprised by how Canadians responded.
    We adapted. We helped one another. And we stayed true to our values.
    Values like compassion, courage, and determination.
metadata: {'source': 'parlinfo44'}

score: 0.6953513068860417
page_content: We must be clear-eyed: the world is a more dangerous and uncertain place than at any point since the Second World War. Canada is facing challenges that are unprecedented in our lifetimes.
metadata: {'source': 'parlinfo45'}

score: 0.6784103995557622
page_content: Their direction is clear: not only do they want Parliamentarians to work together to put this pandemic behind us, they also want bold, concrete solutions to meet the other challenges we face.
    Growing an economy that works for everyone.
    Fighting climate change.
metadata: {'source': 'parlinfo44'}

Conda Config

I updated the conda config to use a strict channel priority. The configuration is now as follows.

(agnt-3.12) PS R:\learn\ds_agent> conda config --set channel_priority strict
(agnt-3.12) PS R:\learn\ds_agent> conda config --show-sources
==> C:\Users\bark\.condarc <==
channel_priority: strict
channels:
  - conda-forge
  - defaults

Sqlite Database

I was planning on building the agentic workflow using Chroma and perhaps web search for the chatbot. But, I noticed the file chroma.sqlite3 in the folder with the other Chroma files.

(agnt-3.12) PS R:\learn\ds_agent> Get-ChildItem .\rek -Recurse | Where-Object { $_.CreationTime -ge '06/26/2025'}

    Directory: R:\learn\ds_agent\rek

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
d-----        2025-06-26     10:53                54439643-c567-41e7-a415-f70209ddb612
-a----        2025-06-26     10:53         364544 chroma.sqlite3

    Directory: R:\learn\ds_agent\rek\54439643-c567-41e7-a415-f70209ddb612

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-a----        2025-06-29     16:11        6284000 data_level0.bin
-a----        2025-06-29     16:11            100 header.bin
-a----        2025-06-29     16:11           4000 length.bin
-a----        2025-06-29     16:11              0 link_lists.bin

So I decided to have a look at its contents. The agentic workflow will likely have to wait for the next post.

New Module

Rather than add non-chatbot code to the current module, I created a new module, rek_chroma_db.py, in the rek directory. I will incremently build it to look at the contents of that file. Likely pruning my code and output while writing the remainder of this post.

Names of Tables in Database

Let’s start simple and get the tables in the Sqlite database.

# rek_chroma_db.py: module to investiage the chroma.sqlite3 database file
#  ver 0.1, 2025.06.30, rek
import pandas as pd
import sqlite3

con = sqlite3.connect("chroma.sqlite3")
cur = con.cursor()

# get name of all tables in database
res = cur.execute("SELECT name FROM sqlite_master WHERE type = 'table'")
# each row returned is represented as a tuple, in this case 2 elements
# with the second being empty, so...
db_tbls = [tp[0] for tp in res.fetchall()]
print(db_tbls)

# alway ensure cursor and connection are closed
cur.close()
con.close()

And, in the terminal:

(agnt-3.12) PS R:\learn\ds_agent\rek> python rek_chroma_db.py
['migrations', 'embeddings_queue', 'embeddings_queue_config', 'collection_metadata', 'segment_metadata', 'tenants', 'databases', 'collections', 'maintenance_log', 'segments', 'embeddings', 'embedding_metadata', 'max_seq_id', 'embedding_fulltext_search', 'embedding_fulltext_search_data', 'embedding_fulltext_search_idx', 'embedding_fulltext_search_content', 'embedding_fulltext_search_docsize', 'embedding_fulltext_search_config']

Columns in Each Table

Let’s get the columns for each table. But first let’s just check out one of them.

... ...
prn_tbl_nms = False
... ...
if prn_tbl_nms:
  print(db_tbls)

# get column names for all tables found above
res = cur.execute(f"PRAGMA table_info({db_tbls[0]})")
cols = res.fetchall()
print(f"\n{cols}")
... ...

And, the output is:

[(0, 'dir', 'TEXT', 1, None, 1), (1, 'version', 'INTEGER', 1, None, 2), (2, 'filename', 'TEXT', 1, None, 0), (3, 'sql', 'TEXT', 1, None, 0), (4, 'hash', 'TEXT', 1, None, 0)]

Only mildly informative.

So I added print(f"\n{cur.description}") after the line to print the cols variable. And:

(agnt-3.12) PS R:\learn\ds_agent\rek> python rek_chroma_db.py

[(0, 'dir', 'TEXT', 1, None, 1), (1, 'version', 'INTEGER', 1, None, 2), (2, 'filename', 'TEXT', 1, None, 0), (3, 'sql', 'TEXT', 1, None, 0), (4, 'hash', 'TEXT', 1, None, 0)]

(('cid', None, None, None, None, None, None), ('name', None, None, None, None, None, None), ('type', None, None, None, None, None, None), ('notnull', None, None, None, None, None, None), ('dflt_value', None, None, None, None, None, None), ('pk', None, None, None, None, None, None))

And I guess I could use that, but saw a comment suggesting Pandas could tidy things up for me. I commented out the above code and added the following.

... ...
import pandas as pd
... ...
cols = pd.read_sql_query(f"PRAGMA table_info({db_tbls[0]})", con)
print(f"\n{cols.head()}")
... ...

And what I got was definitely a much tidier output for the migrations table.

(agnt-3.12) PS R:\learn\ds_agent\rek> python rek_chroma_db.py

   cid      name     type  notnull dflt_value  pk
0    0       dir     TEXT        1       None   1
1    1   version  INTEGER        1       None   2
2    2  filename     TEXT        1       None   0
3    3       sql     TEXT        1       None   0
4    4      hash     TEXT        1       None   0

Let’s do all the tables—I know overkill! I didn’t like that extra index column, so I set cid as the index column in the following code.

... ...
for tbl in db_tbls:
  cols = pd.read_sql_query(f"PRAGMA table_info({tbl})", con, index_col="cid")
  print(f"\n{tbl}:")
  print(f"{cols.head()}")
... ...

And the lengthy output is as follows.

(agnt-3.12) PS R:\learn\ds_agent\rek> python rek_chroma_db.py

migrations:
         name     type  notnull dflt_value  pk
cid
0         dir     TEXT        1       None   1
1     version  INTEGER        1       None   2
2    filename     TEXT        1       None   0
3         sql     TEXT        1       None   0
4        hash     TEXT        1       None   0

embeddings_queue:
           name       type  notnull         dflt_value  pk
cid
0        seq_id    INTEGER        0               None   1
1    created_at  TIMESTAMP        1  CURRENT_TIMESTAMP   0
2     operation    INTEGER        1               None   0
3         topic       TEXT        1               None   0
4            id       TEXT        1               None   0

embeddings_queue_config:
                name     type  notnull dflt_value  pk
cid
0                 id  INTEGER        0       None   1
1    config_json_str     TEXT        0       None   0

collection_metadata:
              name     type  notnull dflt_value  pk
cid
0    collection_id     TEXT        0       None   1
1              key     TEXT        1       None   2
2        str_value     TEXT        0       None   0
3        int_value  INTEGER        0       None   0
4      float_value     REAL        0       None   0

segment_metadata:
            name     type  notnull dflt_value  pk
cid
0     segment_id     TEXT        0       None   1
1            key     TEXT        1       None   2
2      str_value     TEXT        0       None   0
3      int_value  INTEGER        0       None   0
4    float_value     REAL        0       None   0

tenants:
    name  type  notnull dflt_value  pk
cid
0     id  TEXT        0       None   1

databases:
          name  type  notnull dflt_value  pk
cid
0           id  TEXT        0       None   1
1         name  TEXT        1       None   0
2    tenant_id  TEXT        1       None   0

collections:
                name     type  notnull dflt_value  pk
cid
0                 id     TEXT        0       None   1
1               name     TEXT        1       None   0
2          dimension  INTEGER        0       None   0
3        database_id     TEXT        1       None   0
4    config_json_str     TEXT        0       None   0

maintenance_log:
          name  type  notnull dflt_value  pk
cid
0           id   INT        0       None   1
1    timestamp   INT        1       None   0
2    operation  TEXT        1       None   0

segments:
           name  type  notnull dflt_value  pk
cid
0            id  TEXT        0       None   1
1          type  TEXT        1       None   0
2         scope  TEXT        1       None   0
3    collection  TEXT        1       None   0

embeddings:
             name       type  notnull         dflt_value  pk
cid
0              id    INTEGER        0               None   1
1      segment_id       TEXT        1               None   0
2    embedding_id       TEXT        1               None   0
3          seq_id       BLOB        1               None   0
4      created_at  TIMESTAMP        1  CURRENT_TIMESTAMP   0

embedding_metadata:
             name     type  notnull dflt_value  pk
cid
0              id  INTEGER        0       None   1
1             key     TEXT        1       None   2
2    string_value     TEXT        0       None   0
3       int_value  INTEGER        0       None   0
4     float_value     REAL        0       None   0

max_seq_id:
           name  type  notnull dflt_value  pk
cid
0    segment_id  TEXT        0       None   1
1        seq_id  BLOB        1       None   0

embedding_fulltext_search:
             name type  notnull dflt_value  pk
cid
0    string_value             0       None   0

embedding_fulltext_search_data:
      name     type  notnull dflt_value  pk
cid
0       id  INTEGER        0       None   1
1    block     BLOB        0       None   0

embedding_fulltext_search_idx:
      name type  notnull dflt_value  pk
cid
0    segid             1       None   1
1     term             1       None   2
2     pgno             0       None   0

embedding_fulltext_search_content:
    name     type  notnull dflt_value  pk
cid
0     id  INTEGER        0       None   1
1     c0                 0       None   0

embedding_fulltext_search_docsize:
    name     type  notnull dflt_value  pk
cid
0     id  INTEGER        0       None   1
1     sz     BLOB        0       None   0

embedding_fulltext_search_config:
    name type  notnull dflt_value  pk
cid
0      k             1       None   1
1      v             0       None   0

Contents of the collections Table

Okay, let’s see what’s stored in the collections table. The first time I didn’t transpose the dataframe, so Pandas truncated the output in the terminal due to the lengthy values for some of the columns. And, it still truncated the config_json_str column, so added some code to print that out seperately.

... ...
prn_all_cols = False
... ...
if prn_all_cols:
  for tbl in db_tbls:
... ...
tbl = "collections"
res = pd.read_sql_query(f"SELECT * FROM {tbl}", con)
print(f"\n{tbl}:")
print(f"{res.transpose()}")
print(f"\n{res.at[0, 'config_json_str']}")
(agnt-3.12) PS R:\learn\ds_agent\rek> python rek_chroma_db.py

collections:
                                                                 0
id                            abf4c884-9968-4a32-8961-3cf7049eaae5
name                                                     sftthrone
dimension                                                     1536
database_id                   00000000-0000-0000-0000-000000000000
config_json_str  {"hnsw_configuration": {"space": "l2", "ef_con...

{"hnsw_configuration": {"space": "l2", "ef_construction": 100, "ef_search": 100, "num_threads": 16, "M": 16, "resize_factor": 1.2, "batch_size": 100, "sync_threshold": 1000, "_type": "HNSWConfigurationInternal"}, "_type": "CollectionConfigurationInternal"}

And, sftthrone is the one and only collection I created above.

embedding_metadata Table

And a quick look at one more table.

tbl = "embedding_metadata"
res = pd.read_sql_query(f"SELECT * FROM {tbl}", con)
print(f"\n{tbl}:")
print(f"{res.head(6)}")
print(f"... ...\n{res.tail(6)}")
(agnt-3.12) PS R:\learn\ds_agent\rek> python rek_chroma_db.py

embedding_metadata:
   id              key                                       string_value int_value float_value bool_value
0   1           source                                         parlinfo45      None        None       None
1   1  chroma:document                                         ## Opening      None        None       None
2   2           source                                         parlinfo45      None        None       None
3   2  chroma:document  When my dear late mother, Queen Elizabeth II, ...      None        None       None
4   3           source                                         parlinfo45      None        None       None
5   3  chroma:document  were under threat. Canada was emerging as a gr...      None        None       None
... ...
    id              key                                       string_value int_value float_value bool_value
28  15           source                                         parlinfo44      None        None       None
29  15  chroma:document  Their direction is clear: not only do they wan...      None        None       None
30  16           source                                         parlinfo44      None        None       None
31  16  chroma:document  Growing an economy that works for everyone.\n ...      None        None       None
32  17           source                                         parlinfo44      None        None       None
33  17  chroma:document  Yes, the decade got off to an incredibly diffi...      None        None       None

I will likely have a quick look at all the tables. But that’s enough for this post.

Done I Think

Well, code for different purposes. Lot’s of terminal output (perhaps mostly useless). But, a learning experience nonetheless.

I am definitely leaving coding the agentic workflow until next time. Not that it is probably going to be a lengthy post.

Until then, side tracking may not get you to the current goal, but it will almost always teach you something.

Resources