Okay, don’t know how this actually fits in with agentic chatbots, but I want to look at using a vector database to store document chunks and embeddings. May also after this take a look at graph databases. Though the latter is presently seriously iffy.
We previously chunked a document then stored the embeddings in an in-memory vector store. I thought it would be interesting to take a look at doing this with a vector database. After looking around a bit, I am going to try to use ChromaDB.
Embeddings, vector search, document storage, full-text search, metadata filtering, and multi-modal. All in one place. Retrieval that just works. As it should be.
Chroma
Install Chroma
Fortunately it is available on conda-forge. Unfortunately a large number of files installed and/or updated.
(agnt-3.12) PS R:\learn\ds_agent> conda install -c conda-forge chromadb
Retrieving notices: ...working... done
Collecting package metadata (current_repodata.json): done
Solving environment: done
## Package Plan ##
... ...
Proceed ([y]/n)? y
Downloading and Extracting Packages
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Test Installation
The heading says ‘Test’ but this first bit is most likely playing around with Chroma on a small sample of documents. Well 2 very short documents. The content is taken from the Canadian Speech from the Throne for the 44th and 45th Parliaments. In both cases it is the section titled ‘Opening’.
This did take a bit of fooling around. The simple collection creation in the Chroma docs was not what I wanted. It did not chunk the docs nor generate embeddings for the chunks.
So back to basics.
# chat_bot_5.py:
# version: 0.1.0: 2025.06.23, rek, look at using vector database, ChromaDB
from dotenv import load_dotenv
import chromadb
from langchain.chains import VectorDBQA
from langchain_core.prompts import ChatPromptTemplate
# from langchain.embeddings import OpenAIEmbeddings
# from langchain_community.embeddings import OpenAIEmbeddings
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_mistralai.chat_models import ChatMistralAI
from langchain_openai import ChatOpenAI, OpenAI
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
# from langchain.vectorstores.chroma import Chroma
from langchain_community.vectorstores import Chroma
load_dotenv()
llm = ChatOpenAI(model="gpt-4o-mini")
embddg = OpenAIEmbeddings()
tst_1 = """## Opening
When my dear late mother, Queen Elizabeth II, opened a new Canadian Parliament in 1957, the Second World War remained a fresh, painful memory. The Cold War was intensifying. Freedom and democracy were under threat. Canada was emerging as a growing economic power and a force for peace in the world. In the decades since, history has been punctuated by epoch-making events: the Vietnam War, the fall of the Berlin Wall, and the start of the War on Terror. Today, Canada faces another critical moment. Democracy, pluralism, the rule of law, self-determination, and freedom are values which Canadians hold dear, and ones which the Government is determined to protect.
The system of open global trade that, while not perfect, has helped to deliver prosperity for Canadians for decades, is changing. Canada’s relationships with partners are also changing.
We must be clear-eyed: the world is a more dangerous and uncertain place than at any point since the Second World War. Canada is facing challenges that are unprecedented in our lifetimes.
Many Canadians are feeling anxious and worried about the drastically changing world around them. Fundamental change is always unsettling. Yet this moment is also an incredible opportunity. An opportunity for renewal. An opportunity to think big and to act bigger. An opportunity for Canada to embark on the largest transformation of its economy since the Second World War. A confident Canada, which has welcomed new Canadians, including from some of the most tragic global conflict zones, can seize this opportunity by recognising that all Canadians can give themselves far more than any foreign power on any continent can ever take away. And that by staying true to Canadian values, Canada can build new alliances and a new economy that serves all Canadians.
"""
tst_2 = """## Opening
As we speak, British Columbians are facing immeasurable challenges as their homes, their communities, and their wellbeing are impacted by terrible flooding.
But in a time of crisis, we know how Canadians respond. We step up and we are there for each other.
And the Government will continue to be there for the people of British Columbia.
In 2020, Canadians did not know they would face the crisis of a once-in-a-century pandemic. But, as always, no one should be surprised by how Canadians responded.
We adapted. We helped one another. And we stayed true to our values.
Values like compassion, courage, and determination.
Values like democracy.
And in this difficult time, Canadians made a democratic choice.
Their direction is clear: not only do they want Parliamentarians to work together to put this pandemic behind us, they also want bold, concrete solutions to meet the other challenges we face.
Growing an economy that works for everyone.
Fighting climate change.
Moving forward on the path of reconciliation.
Making sure our communities are safe, healthy, and inclusive.
Yes, the decade got off to an incredibly difficult start, but this is the time to rebuild.
This is the moment for Parliamentarians to work together to get big things done, and shape a better future for our kids.
"""
# split docs into chunks for embedding
txt_spltr = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=100)
# need LangChain document objects not text strings
tst_docs = txt_spltr.create_documents(
texts=[tst_1, tst_2],
metadatas=[{"source": "parlinfo45"}, {"source": "parlinfo44"}],
)
chunks = txt_spltr.split_documents(tst_docs)
# create and persist the Chroma collection for our 2 documents
sftt_coll = Chroma.from_documents(
documents=chunks,
embedding=embddg,
persist_directory="rek",
collection_name="sftthrone"
)
# Retrieving the context from the DB using similarity search
query_text = "What was said about the pandemic?"
results = sftt_coll.similarity_search_with_relevance_scores(query_text, k=3)
for tdoc, tscr in results:
print(f"\nscore: {tscr}\npage_content: {tdoc.page_content}\nmetadata: {tdoc.metadata}")
And in the terminal, the above code produced the following.
(agnt-3.12) PS R:\learn\ds_agent> python chat_bot_5.py
score: 0.7307933852148415
page_content: In 2020, Canadians did not know they would face the crisis of a once-in-a-century pandemic. But, as always, no one should be surprised by how Canadians responded.
We adapted. We helped one another. And we stayed true to our values.
Values like compassion, courage, and determination.
metadata: {'source': 'parlinfo44'}
score: 0.6953513068860417
page_content: We must be clear-eyed: the world is a more dangerous and uncertain place than at any point since the Second World War. Canada is facing challenges that are unprecedented in our lifetimes.
metadata: {'source': 'parlinfo45'}
score: 0.6784103995557622
page_content: Their direction is clear: not only do they want Parliamentarians to work together to put this pandemic behind us, they also want bold, concrete solutions to meet the other challenges we face.
Growing an economy that works for everyone.
Fighting climate change.
metadata: {'source': 'parlinfo44'}
Test Loading Collection from Persisted Datastore
Okay, let’s modify the code to load the collection from the persisted data. A new boolean to control code flow. And a new if/else block.
... ...
load_dotenv()
do_newDB = False
llm = ChatOpenAI(model="gpt-4o-mini")
embddg = OpenAIEmbeddings()
if do_newDB:
tst_1 = """## Opening
When my dear late mother, Queen Elizabeth II, opened a new Canadian Parliament in 1957, the Second World War remained a fresh, painful memory. The Cold War was intensifying. Freedom and democracy were under threat.
... ...
"""
sftt_coll = Chroma.from_documents(
documents=chunks,
embedding=embddg,
persist_directory="rek",
collection_name="sftthrone"
)
else:
client = chromadb.PersistentClient(path='rek')
print(f"colletcions: {client.list_collections()}")
sftt_coll = Chroma(
embedding_function=embddg,
persist_directory="rek",
collection_name="sftthrone"
)
# Retrieving the context from the DB using similarity search
query_text = "What was said about the pandemic?"
results = sftt_coll.similarity_search_with_relevance_scores(query_text, k=3)
for tdoc, tscr in results:
print(f"\nscore: {tscr}\npage_content: {tdoc.page_content}\nmetadata: {tdoc.metadata}")
When running the above I got the following warning in the terminal (along with the expected output).
R:\learn\ds_agent\chat_bot_5.py💯 LangChainDeprecationWarning: The class `Chroma` was deprecated in LangChain 0.2.9 and will be removed in 1.0. An updated version of the class exists in the :class:`~langchain-chroma package and should be used instead. To use it run `pip install -U :class:`~langchain-chroma` and import as `from :class:`~langchain_chroma import Chroma``.
sftt_coll = Chroma(
So figured I would get that ’new’ package and go from there before showing you the output. However, during the first attempt, the libmamba
solver failed, so I tried the classic solver.
Turns out the issue was likely not with the solver. My conda config had the following:
channel_priority: flexible
channels:
- conda-forge
- defaults
So, it was apparently trying the defaults channel before the conda-forge channel. Even though I specified the conda-forge channel in the command. After waiting long enough I got the following.
Retrieving notices: ...working... done
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: done
## Package Plan ##
environment location: E:\appDev\Miniconda3\envs\agnt-3.12
added / updated specs:
- langchain-chroma
The following packages will be downloaded:
package | build
---------------------------|-----------------
langchain-chroma-0.2.3 | pyhd8ed1ab_0 17 KB conda-forge
------------------------------------------------------------
The following NEW packages will be INSTALLED:
langchain-chroma conda-forge/noarch::langchain-chroma-0.2.3-pyhd8ed1ab_0
Proceed ([y]/n)? y
Downloading and Extracting Packages
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
And after updating the imports, I got the above output (first “test”), without the warning message (see following).
(agnt-3.12) PS R:\learn\ds_agent> python chat_bot_5.py
colletcions: ['sftthrone']
score: 0.7307933852148415
page_content: In 2020, Canadians did not know they would face the crisis of a once-in-a-century pandemic. But, as always, no one should be surprised by how Canadians responded.
We adapted. We helped one another. And we stayed true to our values.
Values like compassion, courage, and determination.
metadata: {'source': 'parlinfo44'}
score: 0.6953513068860417
page_content: We must be clear-eyed: the world is a more dangerous and uncertain place than at any point since the Second World War. Canada is facing challenges that are unprecedented in our lifetimes.
metadata: {'source': 'parlinfo45'}
score: 0.6784103995557622
page_content: Their direction is clear: not only do they want Parliamentarians to work together to put this pandemic behind us, they also want bold, concrete solutions to meet the other challenges we face.
Growing an economy that works for everyone.
Fighting climate change.
metadata: {'source': 'parlinfo44'}
Conda Config
I updated the conda config to use a strict channel priority. The configuration is now as follows.
(agnt-3.12) PS R:\learn\ds_agent> conda config --set channel_priority strict
(agnt-3.12) PS R:\learn\ds_agent> conda config --show-sources
==> C:\Users\bark\.condarc <==
channel_priority: strict
channels:
- conda-forge
- defaults
Sqlite Database
I was planning on building the agentic workflow using Chroma and perhaps web search for the chatbot. But, I noticed the file chroma.sqlite3
in the folder with the other Chroma files.
(agnt-3.12) PS R:\learn\ds_agent> Get-ChildItem .\rek -Recurse | Where-Object { $_.CreationTime -ge '06/26/2025'}
Directory: R:\learn\ds_agent\rek
Mode LastWriteTime Length Name
---- ------------- ------ ----
d----- 2025-06-26 10:53 54439643-c567-41e7-a415-f70209ddb612
-a---- 2025-06-26 10:53 364544 chroma.sqlite3
Directory: R:\learn\ds_agent\rek\54439643-c567-41e7-a415-f70209ddb612
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a---- 2025-06-29 16:11 6284000 data_level0.bin
-a---- 2025-06-29 16:11 100 header.bin
-a---- 2025-06-29 16:11 4000 length.bin
-a---- 2025-06-29 16:11 0 link_lists.bin
So I decided to have a look at its contents. The agentic workflow will likely have to wait for the next post.
New Module
Rather than add non-chatbot code to the current module, I created a new module, rek_chroma_db.py
, in the rek
directory. I will incremently build it to look at the contents of that file. Likely pruning my code and output while writing the remainder of this post.
Names of Tables in Database
Let’s start simple and get the tables in the Sqlite database.
# rek_chroma_db.py: module to investiage the chroma.sqlite3 database file
# ver 0.1, 2025.06.30, rek
import pandas as pd
import sqlite3
con = sqlite3.connect("chroma.sqlite3")
cur = con.cursor()
# get name of all tables in database
res = cur.execute("SELECT name FROM sqlite_master WHERE type = 'table'")
# each row returned is represented as a tuple, in this case 2 elements
# with the second being empty, so...
db_tbls = [tp[0] for tp in res.fetchall()]
print(db_tbls)
# alway ensure cursor and connection are closed
cur.close()
con.close()
And, in the terminal:
(agnt-3.12) PS R:\learn\ds_agent\rek> python rek_chroma_db.py
['migrations', 'embeddings_queue', 'embeddings_queue_config', 'collection_metadata', 'segment_metadata', 'tenants', 'databases', 'collections', 'maintenance_log', 'segments', 'embeddings', 'embedding_metadata', 'max_seq_id', 'embedding_fulltext_search', 'embedding_fulltext_search_data', 'embedding_fulltext_search_idx', 'embedding_fulltext_search_content', 'embedding_fulltext_search_docsize', 'embedding_fulltext_search_config']
Columns in Each Table
Let’s get the columns for each table. But first let’s just check out one of them.
... ...
prn_tbl_nms = False
... ...
if prn_tbl_nms:
print(db_tbls)
# get column names for all tables found above
res = cur.execute(f"PRAGMA table_info({db_tbls[0]})")
cols = res.fetchall()
print(f"\n{cols}")
... ...
And, the output is:
[(0, 'dir', 'TEXT', 1, None, 1), (1, 'version', 'INTEGER', 1, None, 2), (2, 'filename', 'TEXT', 1, None, 0), (3, 'sql', 'TEXT', 1, None, 0), (4, 'hash', 'TEXT', 1, None, 0)]
Only mildly informative.
So I added print(f"\n{cur.description}")
after the line to print the cols
variable. And:
(agnt-3.12) PS R:\learn\ds_agent\rek> python rek_chroma_db.py
[(0, 'dir', 'TEXT', 1, None, 1), (1, 'version', 'INTEGER', 1, None, 2), (2, 'filename', 'TEXT', 1, None, 0), (3, 'sql', 'TEXT', 1, None, 0), (4, 'hash', 'TEXT', 1, None, 0)]
(('cid', None, None, None, None, None, None), ('name', None, None, None, None, None, None), ('type', None, None, None, None, None, None), ('notnull', None, None, None, None, None, None), ('dflt_value', None, None, None, None, None, None), ('pk', None, None, None, None, None, None))
And I guess I could use that, but saw a comment suggesting Pandas could tidy things up for me. I commented out the above code and added the following.
... ...
import pandas as pd
... ...
cols = pd.read_sql_query(f"PRAGMA table_info({db_tbls[0]})", con)
print(f"\n{cols.head()}")
... ...
And what I got was definitely a much tidier output for the migrations
table.
(agnt-3.12) PS R:\learn\ds_agent\rek> python rek_chroma_db.py
cid name type notnull dflt_value pk
0 0 dir TEXT 1 None 1
1 1 version INTEGER 1 None 2
2 2 filename TEXT 1 None 0
3 3 sql TEXT 1 None 0
4 4 hash TEXT 1 None 0
Let’s do all the tables—I know overkill! I didn’t like that extra index column, so I set cid
as the index column in the following code.
... ...
for tbl in db_tbls:
cols = pd.read_sql_query(f"PRAGMA table_info({tbl})", con, index_col="cid")
print(f"\n{tbl}:")
print(f"{cols.head()}")
... ...
And the lengthy output is as follows.
(agnt-3.12) PS R:\learn\ds_agent\rek> python rek_chroma_db.py
migrations:
name type notnull dflt_value pk
cid
0 dir TEXT 1 None 1
1 version INTEGER 1 None 2
2 filename TEXT 1 None 0
3 sql TEXT 1 None 0
4 hash TEXT 1 None 0
embeddings_queue:
name type notnull dflt_value pk
cid
0 seq_id INTEGER 0 None 1
1 created_at TIMESTAMP 1 CURRENT_TIMESTAMP 0
2 operation INTEGER 1 None 0
3 topic TEXT 1 None 0
4 id TEXT 1 None 0
embeddings_queue_config:
name type notnull dflt_value pk
cid
0 id INTEGER 0 None 1
1 config_json_str TEXT 0 None 0
collection_metadata:
name type notnull dflt_value pk
cid
0 collection_id TEXT 0 None 1
1 key TEXT 1 None 2
2 str_value TEXT 0 None 0
3 int_value INTEGER 0 None 0
4 float_value REAL 0 None 0
segment_metadata:
name type notnull dflt_value pk
cid
0 segment_id TEXT 0 None 1
1 key TEXT 1 None 2
2 str_value TEXT 0 None 0
3 int_value INTEGER 0 None 0
4 float_value REAL 0 None 0
tenants:
name type notnull dflt_value pk
cid
0 id TEXT 0 None 1
databases:
name type notnull dflt_value pk
cid
0 id TEXT 0 None 1
1 name TEXT 1 None 0
2 tenant_id TEXT 1 None 0
collections:
name type notnull dflt_value pk
cid
0 id TEXT 0 None 1
1 name TEXT 1 None 0
2 dimension INTEGER 0 None 0
3 database_id TEXT 1 None 0
4 config_json_str TEXT 0 None 0
maintenance_log:
name type notnull dflt_value pk
cid
0 id INT 0 None 1
1 timestamp INT 1 None 0
2 operation TEXT 1 None 0
segments:
name type notnull dflt_value pk
cid
0 id TEXT 0 None 1
1 type TEXT 1 None 0
2 scope TEXT 1 None 0
3 collection TEXT 1 None 0
embeddings:
name type notnull dflt_value pk
cid
0 id INTEGER 0 None 1
1 segment_id TEXT 1 None 0
2 embedding_id TEXT 1 None 0
3 seq_id BLOB 1 None 0
4 created_at TIMESTAMP 1 CURRENT_TIMESTAMP 0
embedding_metadata:
name type notnull dflt_value pk
cid
0 id INTEGER 0 None 1
1 key TEXT 1 None 2
2 string_value TEXT 0 None 0
3 int_value INTEGER 0 None 0
4 float_value REAL 0 None 0
max_seq_id:
name type notnull dflt_value pk
cid
0 segment_id TEXT 0 None 1
1 seq_id BLOB 1 None 0
embedding_fulltext_search:
name type notnull dflt_value pk
cid
0 string_value 0 None 0
embedding_fulltext_search_data:
name type notnull dflt_value pk
cid
0 id INTEGER 0 None 1
1 block BLOB 0 None 0
embedding_fulltext_search_idx:
name type notnull dflt_value pk
cid
0 segid 1 None 1
1 term 1 None 2
2 pgno 0 None 0
embedding_fulltext_search_content:
name type notnull dflt_value pk
cid
0 id INTEGER 0 None 1
1 c0 0 None 0
embedding_fulltext_search_docsize:
name type notnull dflt_value pk
cid
0 id INTEGER 0 None 1
1 sz BLOB 0 None 0
embedding_fulltext_search_config:
name type notnull dflt_value pk
cid
0 k 1 None 1
1 v 0 None 0
Contents of the collections
Table
Okay, let’s see what’s stored in the collections
table. The first time I didn’t transpose the dataframe, so Pandas truncated the output in the terminal due to the lengthy values for some of the columns. And, it still truncated the config_json_str
column, so added some code to print that out seperately.
... ...
prn_all_cols = False
... ...
if prn_all_cols:
for tbl in db_tbls:
... ...
tbl = "collections"
res = pd.read_sql_query(f"SELECT * FROM {tbl}", con)
print(f"\n{tbl}:")
print(f"{res.transpose()}")
print(f"\n{res.at[0, 'config_json_str']}")
(agnt-3.12) PS R:\learn\ds_agent\rek> python rek_chroma_db.py
collections:
0
id abf4c884-9968-4a32-8961-3cf7049eaae5
name sftthrone
dimension 1536
database_id 00000000-0000-0000-0000-000000000000
config_json_str {"hnsw_configuration": {"space": "l2", "ef_con...
{"hnsw_configuration": {"space": "l2", "ef_construction": 100, "ef_search": 100, "num_threads": 16, "M": 16, "resize_factor": 1.2, "batch_size": 100, "sync_threshold": 1000, "_type": "HNSWConfigurationInternal"}, "_type": "CollectionConfigurationInternal"}
And, sftthrone
is the one and only collection I created above.
embedding_metadata
Table
And a quick look at one more table.
tbl = "embedding_metadata"
res = pd.read_sql_query(f"SELECT * FROM {tbl}", con)
print(f"\n{tbl}:")
print(f"{res.head(6)}")
print(f"... ...\n{res.tail(6)}")
(agnt-3.12) PS R:\learn\ds_agent\rek> python rek_chroma_db.py
embedding_metadata:
id key string_value int_value float_value bool_value
0 1 source parlinfo45 None None None
1 1 chroma:document ## Opening None None None
2 2 source parlinfo45 None None None
3 2 chroma:document When my dear late mother, Queen Elizabeth II, ... None None None
4 3 source parlinfo45 None None None
5 3 chroma:document were under threat. Canada was emerging as a gr... None None None
... ...
id key string_value int_value float_value bool_value
28 15 source parlinfo44 None None None
29 15 chroma:document Their direction is clear: not only do they wan... None None None
30 16 source parlinfo44 None None None
31 16 chroma:document Growing an economy that works for everyone.\n ... None None None
32 17 source parlinfo44 None None None
33 17 chroma:document Yes, the decade got off to an incredibly diffi... None None None
I will likely have a quick look at all the tables. But that’s enough for this post.
Done I Think
Well, code for different purposes. Lot’s of terminal output (perhaps mostly useless). But, a learning experience nonetheless.
I am definitely leaving coding the agentic workflow until next time. Not that it is probably going to be a lengthy post.
Until then, side tracking may not get you to the current goal, but it will almost always teach you something.
Resources
- Chroma: Getting Started
- Chroma Cookbook: Collections
- LangChain | API Reference | Chroma
- ! Deprecated — langchain_community.vectorstores.chroma.Chroma
- LangChain Python API Reference | langchain-chroma | vectorstores | Chroma
- hwchase17 | chroma-langchain
- What is Chroma DB?
- Chroma: One of the best vector databases to use with LangChain for storing embeddings
- Implementing RAG in LangChain with Chroma: A Step-by-Step Guide
- Parliament of Canada: Speeches from the Throne
- conda config
- sqlite3 — DB-API 2.0 interface for SQLite databases
- sqlite3 - PRAGMA Statements
- pandas.read_sql_query
- How to Find Files Modified After a Specific Date Using PowerShell?