Okay, let’s look at using private or proprietary data stored in documents of one sort or another. Not too sure how I am going to go about this as I don’t really have any such data. I will likely just use a web page that was published after the LLM’s training was completed. Perhaps something related to Mark Carney’s cabinet appointments on May 13, 2025. Don’t expect that to be in the LLM’s training data.
And, yes we could just continue to use a web search; but, that’s not the point of this current project. We want to look at having an agent use another set of tools and/or toolkits to access local, and likely private, documents.
CSV File
I have decided to use a CSV file or two as the RAG data source. They are related to the 2025 Canadian Federal Election. I downloaded them from Members of Parliament . Specifically, the Election Candidate and Constituencies datasets. I may also continue to use a web page as well. If so, likely the Wikipedia page on the election. Though not currently sure about the latter.
Not that I have found a lot of examples of how to code this form of RAG. But, that’s what learning is all about.
New Library or Two
To use the CSV files as agent sources, I plan to use create_csv_agent
. Which means I need to install the langchain-experimental
library/package. And, since a Dataframe is used under the hood by create_csv_agent
, that will also require that Pandas is installed. For some reason, miniconda did not find Pandas to be a dependency of langchain-experimental
.
agnt-3.12) PS R:\learn\ds_agent> conda install -c conda-forge langchain-experimental pandas
Retrieving notices: ...working... done
Collecting package metadata (current_repodata.json): done
Solving environment: done
... ...
The following NEW packages will be INSTALLED:
langchain-experim~ conda-forge/noarch::langchain-experimental-0.3.4-pyhd8ed1ab_1
pandas conda-forge/win-64::pandas-2.2.3-py312h72972c8_3
python-dateutil conda-forge/noarch::python-dateutil-2.9.0.post0-pyhff2d567_1
python-tzdata conda-forge/noarch::python-tzdata-2025.2-pyhd8ed1ab_0
pytz conda-forge/noarch::pytz-2025.2-pyhd8ed1ab_0
six conda-forge/noarch::six-1.17.0-pyhd8ed1ab_0
... ...
Downloading and Extracting Packages
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Some Test Code
Okay, a wee test, let’s see if I can get create_csv_agent
to work. Add a bit of code to previous chatbot module.
... ...
from langchain_experimental.agents.agent_toolkits import create_csv_agent
... ...
model = ChatMistralAI(model="mistral-large-latest")
csv_exec = create_csv_agent(
model,
"rek/constituencies.csv",
verbose=True,
)
print(csv_exec)
And, that did not work; another traceback in the terminal. The key part of which was:
ValueError: This agent relies on access to a python repl tool which can execute arbitrary code. This can be dangerous and requires a specially sandboxed environment to be safely used. Please read the security notice in the doc-string of this function. You must opt-in to use this functionality by setting allow_dangerous_code=True.For general security guidelines, please see: https://python.langchain.com/docs/security/
So add that parameter to the call to create_csv_agent
.
csv_exec = create_csv_agent(
model,
"rek/constituencies.csv",
allow_dangerous_code=True,
verbose=True,
)
Then another problem!
(agnt-3.12) PS R:\learn\ds_agent> python chat_bot_2.py
Traceback (most recent call last):
File "E:\appDev\Miniconda3\envs\agnt-3.12\Lib\site-packages\pandas\compat\_optional.py", line 135, in import_optional_dependency
module = importlib.import_module(name)
... ...
ImportError: Missing optional dependency 'tabulate'. Use pip or conda to install tabulate.
So, I installed tabulate
.
(agnt-3.12) PS R:\learn\ds_agent> conda install conda-forge::tabulate
Collecting package metadata (current_repodata.json): done
Solving environment: done
... ...
The following NEW packages will be INSTALLED:
tabulate conda-forge/noarch::tabulate-0.9.0-pyhd8ed1ab_2
... ...
Executing transaction: done
And finally, got the following output in the terminal.
(agnt-3.12) PS R:\learn\ds_agent> python chat_bot_2.py
verbose=True agent=RunnableAgent(runnable=RunnableAssign(mapper={
agent_scratchpad: RunnableLambda(lambda x: format_log_to_str(x['intermediate_steps']))
})
| PromptTemplate(input_variables=['agent_scratchpad', 'input'], input_types={}, partial_variables={'df_head': "| | Constituency | Province / Territory | Honorific Title | First Name | Last Name | Political Affiliation |\n|---:|:---------------------------------|:-----------------------|:------------------|:-------------|:------------|:------------------------|\n| 0 | Abbotsford—South Langley | British Columbia | nan | Sukhman | Gill | Conservative |\n| 1 | Abitibi—Baie-James—Nunavik—Eeyou | Quebec | Hon. | Mandy | Gull-Masty | Liberal |\n| 2 | Abitibi—Témiscamingue | Quebec | nan | Sébastien | Lemire | Bloc Québécois |\n| 3 | Acadie—Annapolis | Nova Scotia | nan | Chris | d'Entremont | Conservative |\n| 4 | Acadie—Bathurst | New Brunswick | nan | Serge | Cormier | Liberal |", 'tools': 'python_repl_ast - A Python shell. Use this to execute python commands. Input should be a valid python command. When using this tool, sometimes output is abbreviated - make sure it does not look abbreviated before using it in your answer.', 'tool_names': 'python_repl_ast'}, template='\nYou are working with a pandas dataframe in Python. The name of the dataframe is `df`.\nYou should use the tools below to answer the question posed of you:\n\n{tools}\n\nUse the following format:\n\nQuestion: the input question you must answer\nThought: you should always think about what to do\nAction: the action to take, should be one of [{tool_names}]\nAction Input: the input to the action\nObservation: the result of the action\n... (this Thought/Action/Action Input/Observation can repeat N times)\nThought: I now know the final answer\nFinal Answer: the final answer to the original input question\n\n\nThis is the result of `print(df.head())`:\n{df_head}\n\nBegin!\nQuestion: {input}\n{agent_scratchpad}')
| RunnableBinding(bound=ChatMistralAI(client=<httpx.Client object at 0x000002737D2AF800>, async_client=<httpx.AsyncClient object at 0x000002737D2AF890>, mistral_api_key=SecretStr('**********'), endpoint='https://api.mistral.ai/v1', model='mistral-large-latest', model_kwargs={}), kwargs={'stop': ['\nObservation']}, config={}, config_factories=[])
| ReActSingleInputOutputParser(), input_keys_arg=['input'], return_keys_arg=['output'], stream_runnable=True) tools=[PythonAstREPLTool(globals={}, locals={'df': Constituency Province / Territory ... Last Name Political Affiliation
0 Abbotsford—South Langley British Columbia ... Gill Conservative
1 Abitibi—Baie-James—Nunavik—Eeyou Quebec ... Gull-Masty Liberal
2 Abitibi—Témiscamingue Quebec ... Lemire Bloc Québécois
3 Acadie—Annapolis Nova Scotia ... d'Entremont Conservative
4 Acadie—Bathurst New Brunswick ... Cormier Liberal
.. ... ... ... ... ...
338 York Centre Ontario ... Baber Conservative
339 York South—Weston—Etobicoke Ontario ... Hussen Liberal
340 York—Durham Ontario ... Mantle Conservative
341 Yorkton—Melville Saskatchewan ... Wagantall Conservative
342 Yukon Yukon ... Hanley Liberal
[343 rows x 6 columns]})]
And there are indeed 343 data rows in that CSV file.
Let’s get to using that agent with our chatbot model.
Use Agent Executor
Okay time to query the agent executor to see what we get. I started with a variety of errors being reported, especially when using Mistral AI. I eventually modified the call to create_csv_agent
. And added try/except
loop when invoking the agent executor.
csv_exec = create_csv_agent(
model,
"rek/constituencies.csv",
verbose=True,
allow_dangerous_code=True,
agent_type="openai-tools",
)
... ...
try:
resp = csv_exec.invoke({"input": "How many Liberals were elected in British Columbia?"})
print(resp)
except Exception as e:
print(f"Error: {e}")
Using Mistral AI
(agnt-3.12) PS R:\learn\ds_agent> python chat_bot_2.py
resp = csv_exec("How many Liberals were elected in British Columbia?")
> Entering new AgentExecutor chain...
To determine the number of Liberals elected in British Columbia, you can filter the dataframe `df` based on the "Province / Territory and "Political Affiliation" columns.
Here's how you can do it:
# Filter the dataframe for Liberals in British Columbia
liberals_bc = df[(df['Province / Territory'] == 'British Columbia') & (df['Political Affiliation'] == 'Liberal')]
# Count the number of rows in the filtered dataframe
num_liberals_bc = liberals_bc.shape[0]
# Print the result
print(f'Number of Liberals elected in British Columbia: {num_liberals_bc}')
This code will give you the number of Liberals elected in British Columbia.
> Finished chain.
{'input': 'How many Liberals were elected in British Columbia?', 'output': 'To determine the number of Liberals elected in British Columbia, you can filter the dataframe `df` based on the "Province / Territory" and "Political Affiliation" columns. Here\'s how you can do it:\n\n```python\n# Filter the dataframe for Liberals in British Columbia\nliberals_bc = df[(df[\'Province / Territory\'] == \'British Columbia\') & (df[\'Political Affiliation\'] == \'Liberal\')]\n\n# Count the number of rows in the filtered dataframe\nnum_liberals_bc = liberals_bc.shape[0]\n\n# Print the result\nprint(f\'Number of Liberals elected in British Columbia: {num_liberals_bc}\')\n```\n\nThis code will give you the number of Liberals elected in British Columbia.'}
Not exactly the answer I was looking for. Since we instantiated the agent in verbose mode, it prints out an history of its process, before outputting the final result. I decided against taking the final output and making it prettier.
With OpenAI
(agnt-3.12) PS R:\learn\ds_agent> python chat_bot_2.py
resp = csv_exec("How many Liberals were elected in British Columbia?")
> Entering new AgentExecutor chain...
Invoking: `python_repl_ast` with `{'query': "df[(df['Province / Territory'] == 'British Columbia') & (df['Political Affiliation'] == 'Liberal')].shape[0]"}`
20In British Columbia, 20 Liberals were elected.
> Finished chain.
{'input': 'How many Liberals were elected in British Columbia?', 'output': 'In British Columbia, 20 Liberals were elected.'}
Definitely a better answer.
Mistral AI Again
I decided to change the input prompt slightly. And!!
(agnt-3.12) PS R:\learn\ds_agent> python chat_bot_2.py
resp = csv_exec("Tell me how many Liberals were elected in British Columbia.")
> Entering new AgentExecutor chain...
Invoking: `python_repl_ast` with `{'query': 'df[(df["Political Affiliation"] == "Liberal") & (df["Province / Territory"] == "British Columbia")].shape[0]'}`
20There were 20 Liberals elected in British Columbia.
> Finished chain.
{'input': 'Tell me how many Liberals were elected in British Columbia.', 'output': 'There were 20 Liberals elected in British Columbia.'}
I wanted to see if I could get it to draw a bar chart. But, don’t think that could be displayed in the terminal window. Would likely need to be using a Jupyter notebook for that to work. Though I never confirmed that. So, instead let’s see if we can get it to give us a table as output.
(agnt-3.12) PS R:\learn\ds_agent> python chat_bot_2.py
resp = csv_exec("Provide a table showing how many Liberals were elected by 'Province / Territory'.s")
> Entering new AgentExecutor chain...
Invoking: `python_repl_ast` with `{'query': "df[df['Political Affiliation'] == 'Liberal'].groupby('Province / Territory').size()"}`
Province / Territory
Alberta 2
British Columbia 20
Manitoba 6
New Brunswick 6
Newfoundland and Labrador 5
Northwest Territories 1
Nova Scotia 10
Ontario 70
Prince Edward Island 4
Quebec 44
Saskatchewan 1
Yukon 1
dtype: int64
| Province / Territory | Number of Liberals Elected |
|:----------------------------|-----------------------------:|
| Alberta | 2 |
| British Columbia | 20 |
| Manitoba | 6 |
| New Brunswick | 6 |
| Newfoundland and Labrador | 5 |
| Northwest Territories | 1 |
| Nova Scotia | 10 |
| Ontario | 70 |
| Prince Edward Island | 4 |
| Quebec | 44 |
| Saskatchewan | 1 |
| Yukon | 1 |
> Finished chain.
{'input': "Provide a table showing how many Liberals were elected by 'Province / Territory'.", 'output': '| Province / Territory | Number of Liberals Elected |\n|:----------------------------|-----------------------------:|\n| Alberta | 2 |\n| British Columbia | 20 |\n| Manitoba | 6 |\n| New Brunswick | 6 |\n| Newfoundland and Labrador | 5 |\n| Northwest Territories | 1 |\n| Nova Scotia | 10 |\n| Ontario | 70 |\n| Prince Edward Island | 4 |\n| Quebec | 44 |\n| Saskatchewan | 1 |\n| Yukon
All that stuff above the row ‘> Finished chain’ is the content of the agent’s scratchpad as it works at generating the final response. The first list, above ‘dtype: int64’ is in fact a printout of the Pandas dataframe that was generated by the query displayed above it. The second list is the table it generated, which is also returned in the final/finished output.
Let’s try that without verbose mode, and just print the value of the ‘output’ key.
csv_exec = create_csv_agent(
model,
"rek/constituencies.csv",
allow_dangerous_code=True,
agent_type="openai-tools",
)
try:
resp = csv_exec.invoke({"input": "Provide a table showing how many Liberals were elected by 'Province / Territory'."})
print(resp['output'])
except Exception as e:
print(f"Error: {e}")
And…
(agnt-3.12) PS R:\learn\ds_agent> python chat_bot_2.py
| Province / Territory | Liberal |
|:----------------------------|----------:|
| Alberta | 2 |
| British Columbia | 20 |
| Manitoba | 6 |
| New Brunswick | 6 |
| Newfoundland and Labrador | 5 |
| Northwest Territories | 1 |
| Nova Scotia | 10 |
| Ontario | 70 |
| Prince Edward Island | 4 |
| Quebec | 44 |
| Saskatchewan | 1 |
| Yukon | 1 |
Not Really Working Well
I don’t think this is the best way to go about accessing the CSV files. But really can’t find a way to do it differently without just using Pandas dataframes. Also, if I had used document/text files on my PC, I would have had to go through a number of additional steps.
Done
With local documents we would need to chunk and embed the files’ contents; adding that data/information to a vector store. And, as in the previous post we would need to create tools and bind them to the LLM model. Then create a LangGraph workflow with conditional nodes to make use of the tools. Finally, I would need to embed the input string and use that to search for appropriate chunks in the vector store before supplying everything to the chatbot.
So, I think I will call this one quits. And in the next post look at using the stored document approach. Not sure what that document might be at this point; but, that’s for another day.
My apologies for the poor quality post. But, sadly, life is full of trials and tribulations.
Resources
- LangChain Python API Reference | how-to guides | How to do question answering over CSVs
- LangChain | API Reference | langchain-experimental
- LangChain | API Reference | langchain-experimental | agents | create_csv_agent
- conda-forge / packages / langchain-experimental
We use [Linguist](https://github.com/github-linguist/linguist) to perform language detection and to select third-party grammars for syntax highlighting. You can find out which keywords are valid in the languages YAML file.
- Github | Creating and highlighting code blocks
- Building a CSV Assistant with LangChain
- Build a Chatbot on Your CSV Data With LangChain and OpenAI (paywall ?)
- How to query CSV and Excel files using LangChain
- Chat with your CSV: Visualize Your Data with Langchain and Streamlit
Addendum
In case you are interested, my environment.yml
file now looks like this.
name: agnt-3.12
channels:
- conda-forge
dependencies:
- beautifulsoup4>=4.13.4
- langchain>=0.3.24
- langchain-community>=0.3.23
- langchain-core>=0.3.56
- langchain-experimental>=0.3.4
- langchain-mistralai>=0.2.10
- langchain-openai>=0.3.15
- langchain-text-splitters>=0.3.8
- langgraph>=0.4.1
- pandas>=2.2.3
- python>=3.12.10
- python-dotenv>=1.1.0
- scikit-learn>=1.6.1
- tabulate>=0.9.0
- pip>=25.1.1
- pip:
- tavily-python>=0.7.2