Okay, let’s look at using private or proprietary data stored in documents of one sort or another. Not too sure how I am going to go about this as I don’t really have any such data. I will likely just use a web page that was published after the LLM’s training was completed. Perhaps something related to Mark Carney’s cabinet appointments on May 13, 2025. Don’t expect that to be in the LLM’s training data.

And, yes we could just continue to use a web search; but, that’s not the point of this current project. We want to look at having an agent use another set of tools and/or toolkits to access local, and likely private, documents.

CSV File

I have decided to use a CSV file or two as the RAG data source. They are related to the 2025 Canadian Federal Election. I downloaded them from Members of Parliament . Specifically, the Election Candidate and Constituencies datasets. I may also continue to use a web page as well. If so, likely the Wikipedia page on the election. Though not currently sure about the latter.

Not that I have found a lot of examples of how to code this form of RAG. But, that’s what learning is all about.

New Library or Two

To use the CSV files as agent sources, I plan to use create_csv_agent. Which means I need to install the langchain-experimental library/package. And, since a Dataframe is used under the hood by create_csv_agent, that will also require that Pandas is installed. For some reason, miniconda did not find Pandas to be a dependency of langchain-experimental.

agnt-3.12) PS R:\learn\ds_agent> conda install -c conda-forge langchain-experimental pandas
Retrieving notices: ...working... done
Collecting package metadata (current_repodata.json): done
Solving environment: done
... ...
The following NEW packages will be INSTALLED:

  langchain-experim~ conda-forge/noarch::langchain-experimental-0.3.4-pyhd8ed1ab_1
  pandas             conda-forge/win-64::pandas-2.2.3-py312h72972c8_3
  python-dateutil    conda-forge/noarch::python-dateutil-2.9.0.post0-pyhff2d567_1
  python-tzdata      conda-forge/noarch::python-tzdata-2025.2-pyhd8ed1ab_0
  pytz               conda-forge/noarch::pytz-2025.2-pyhd8ed1ab_0
  six                conda-forge/noarch::six-1.17.0-pyhd8ed1ab_0
... ...
Downloading and Extracting Packages

Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Some Test Code

Okay, a wee test, let’s see if I can get create_csv_agent to work. Add a bit of code to previous chatbot module.

... ...
from langchain_experimental.agents.agent_toolkits import create_csv_agent
... ...
model = ChatMistralAI(model="mistral-large-latest")
csv_exec = create_csv_agent(
  model,
  "rek/constituencies.csv",
  verbose=True,
)

print(csv_exec)

And, that did not work; another traceback in the terminal. The key part of which was:

ValueError: This agent relies on access to a python repl tool which can execute arbitrary code. This can be dangerous and requires a specially sandboxed environment to be safely used. Please read the security notice in the doc-string of this function. You must opt-in to use this functionality by setting allow_dangerous_code=True.For general security guidelines, please see: https://python.langchain.com/docs/security/

So add that parameter to the call to create_csv_agent.

csv_exec = create_csv_agent(
  model,
  "rek/constituencies.csv",
  allow_dangerous_code=True,
  verbose=True,
)

Then another problem!

(agnt-3.12) PS R:\learn\ds_agent> python chat_bot_2.py
Traceback (most recent call last):
  File "E:\appDev\Miniconda3\envs\agnt-3.12\Lib\site-packages\pandas\compat\_optional.py", line 135, in import_optional_dependency
    module = importlib.import_module(name)
... ...
ImportError: Missing optional dependency 'tabulate'.  Use pip or conda to install tabulate.

So, I installed tabulate.

(agnt-3.12) PS R:\learn\ds_agent> conda install conda-forge::tabulate
Collecting package metadata (current_repodata.json): done
Solving environment: done
... ...
The following NEW packages will be INSTALLED:

  tabulate           conda-forge/noarch::tabulate-0.9.0-pyhd8ed1ab_2
... ...
Executing transaction: done

And finally, got the following output in the terminal.

(agnt-3.12) PS R:\learn\ds_agent> python chat_bot_2.py
verbose=True agent=RunnableAgent(runnable=RunnableAssign(mapper={
  agent_scratchpad: RunnableLambda(lambda x: format_log_to_str(x['intermediate_steps']))
})
| PromptTemplate(input_variables=['agent_scratchpad', 'input'], input_types={}, partial_variables={'df_head': "|    | Constituency                     | Province / Territory   | Honorific Title   | First Name   | Last Name   | Political Affiliation   |\n|---:|:---------------------------------|:-----------------------|:------------------|:-------------|:------------|:------------------------|\n|  0 | Abbotsford—South Langley         | British Columbia       | nan               | Sukhman      | Gill        | Conservative            |\n|  1 | Abitibi—Baie-James—Nunavik—Eeyou | Quebec                 | Hon.              | Mandy        | Gull-Masty  | Liberal                 |\n|  2 | Abitibi—Témiscamingue            | Quebec                 | nan               | Sébastien    | Lemire      | Bloc Québécois          |\n|  3 | Acadie—Annapolis                 | Nova Scotia            | nan               | Chris        | d'Entremont | Conservative            |\n|  4 | Acadie—Bathurst                  | New Brunswick          | nan               | Serge        | Cormier     | Liberal                 |", 'tools': 'python_repl_ast - A Python shell. Use this to execute python commands. Input should be a valid python command. When using this tool, sometimes output is abbreviated - make sure it does not look abbreviated before using it in your answer.', 'tool_names': 'python_repl_ast'}, template='\nYou are working with a pandas dataframe in Python. The name of the dataframe is `df`.\nYou should use the tools below to answer the question posed of you:\n\n{tools}\n\nUse the following format:\n\nQuestion: the input question you must answer\nThought: you should always think about what to do\nAction: the action to take, should be one of [{tool_names}]\nAction Input: the input to the action\nObservation: the result of the action\n... (this Thought/Action/Action Input/Observation can repeat N times)\nThought: I now know the final answer\nFinal Answer: the final answer to the original input question\n\n\nThis is the result of `print(df.head())`:\n{df_head}\n\nBegin!\nQuestion: {input}\n{agent_scratchpad}')
| RunnableBinding(bound=ChatMistralAI(client=<httpx.Client object at 0x000002737D2AF800>, async_client=<httpx.AsyncClient object at 0x000002737D2AF890>, mistral_api_key=SecretStr('**********'), endpoint='https://api.mistral.ai/v1', model='mistral-large-latest', model_kwargs={}), kwargs={'stop': ['\nObservation']}, config={}, config_factories=[])
| ReActSingleInputOutputParser(), input_keys_arg=['input'], return_keys_arg=['output'], stream_runnable=True) tools=[PythonAstREPLTool(globals={}, locals={'df':                          Constituency Province / Territory  ...    Last Name Political Affiliation
0            Abbotsford—South Langley     British Columbia  ...         Gill          Conservative
1    Abitibi—Baie-James—Nunavik—Eeyou               Quebec  ...   Gull-Masty               Liberal
2               Abitibi—Témiscamingue               Quebec  ...       Lemire        Bloc Québécois
3                    Acadie—Annapolis          Nova Scotia  ...  d'Entremont          Conservative
4                     Acadie—Bathurst        New Brunswick  ...      Cormier               Liberal
..                                ...                  ...  ...          ...                   ...
338                       York Centre              Ontario  ...        Baber          Conservative
339       York South—Weston—Etobicoke              Ontario  ...       Hussen               Liberal
340                       York—Durham              Ontario  ...       Mantle          Conservative
341                  Yorkton—Melville         Saskatchewan  ...    Wagantall          Conservative
342                             Yukon                Yukon  ...       Hanley               Liberal

[343 rows x 6 columns]})]

And there are indeed 343 data rows in that CSV file.

Let’s get to using that agent with our chatbot model.

Use Agent Executor

Okay time to query the agent executor to see what we get. I started with a variety of errors being reported, especially when using Mistral AI. I eventually modified the call to create_csv_agent. And added try/except loop when invoking the agent executor.

csv_exec = create_csv_agent(
  model,
  "rek/constituencies.csv",
  verbose=True,
  allow_dangerous_code=True,
  agent_type="openai-tools",
)
... ...
try:
  resp = csv_exec.invoke({"input": "How many Liberals were elected in British Columbia?"})
  print(resp)
except Exception as e:
  print(f"Error: {e}")

Using Mistral AI

(agnt-3.12) PS R:\learn\ds_agent> python chat_bot_2.py
resp = csv_exec("How many Liberals were elected in British Columbia?")

> Entering new AgentExecutor chain...
To determine the number of Liberals elected in British Columbia, you can filter the dataframe `df` based on the "Province / Territory and "Political Affiliation" columns.

Here's how you can do it:

# Filter the dataframe for Liberals in British Columbia
liberals_bc = df[(df['Province / Territory'] == 'British Columbia') & (df['Political Affiliation'] == 'Liberal')]

# Count the number of rows in the filtered dataframe
num_liberals_bc = liberals_bc.shape[0]

# Print the result
print(f'Number of Liberals elected in British Columbia: {num_liberals_bc}')

This code will give you the number of Liberals elected in British Columbia.

> Finished chain.
{'input': 'How many Liberals were elected in British Columbia?', 'output': 'To determine the number of Liberals elected in British Columbia, you can filter the dataframe `df` based on the "Province / Territory" and "Political Affiliation" columns. Here\'s how you can do it:\n\n```python\n# Filter the dataframe for Liberals in British Columbia\nliberals_bc = df[(df[\'Province / Territory\'] == \'British Columbia\') & (df[\'Political Affiliation\'] == \'Liberal\')]\n\n# Count the number of rows in the filtered dataframe\nnum_liberals_bc = liberals_bc.shape[0]\n\n# Print the result\nprint(f\'Number of Liberals elected in British Columbia: {num_liberals_bc}\')\n```\n\nThis code will give you the number of Liberals elected in British Columbia.'}

Not exactly the answer I was looking for. Since we instantiated the agent in verbose mode, it prints out an history of its process, before outputting the final result. I decided against taking the final output and making it prettier.

With OpenAI

(agnt-3.12) PS R:\learn\ds_agent> python chat_bot_2.py
resp = csv_exec("How many Liberals were elected in British Columbia?")

> Entering new AgentExecutor chain...

Invoking: `python_repl_ast` with `{'query': "df[(df['Province / Territory'] == 'British Columbia') & (df['Political Affiliation'] == 'Liberal')].shape[0]"}`

20In British Columbia, 20 Liberals were elected.

> Finished chain.
{'input': 'How many Liberals were elected in British Columbia?', 'output': 'In British Columbia, 20 Liberals were elected.'}

Definitely a better answer.

Mistral AI Again

I decided to change the input prompt slightly. And!!

(agnt-3.12) PS R:\learn\ds_agent> python chat_bot_2.py
resp = csv_exec("Tell me how many Liberals were elected in British Columbia.")

> Entering new AgentExecutor chain...

Invoking: `python_repl_ast` with `{'query': 'df[(df["Political Affiliation"] == "Liberal") & (df["Province / Territory"] == "British Columbia")].shape[0]'}`

20There were 20 Liberals elected in British Columbia.

> Finished chain.
{'input': 'Tell me how many Liberals were elected in British Columbia.', 'output': 'There were 20 Liberals elected in British Columbia.'}

I wanted to see if I could get it to draw a bar chart. But, don’t think that could be displayed in the terminal window. Would likely need to be using a Jupyter notebook for that to work. Though I never confirmed that. So, instead let’s see if we can get it to give us a table as output.

(agnt-3.12) PS R:\learn\ds_agent> python chat_bot_2.py
resp = csv_exec("Provide a table showing how many Liberals were elected by 'Province / Territory'.s")

> Entering new AgentExecutor chain...

Invoking: `python_repl_ast` with `{'query': "df[df['Political Affiliation'] == 'Liberal'].groupby('Province / Territory').size()"}`

Province / Territory
Alberta                       2
British Columbia             20
Manitoba                      6
New Brunswick                 6
Newfoundland and Labrador     5
Northwest Territories         1
Nova Scotia                  10
Ontario                      70
Prince Edward Island          4
Quebec                       44
Saskatchewan                  1
Yukon                         1
dtype: int64
| Province / Territory     |   Number of Liberals Elected |
|:----------------------------|-----------------------------:|
| Alberta                     |                            2 |
| British Columbia           |                           20 |
| Manitoba                    |                            6 |
| New Brunswick               |                            6 |
| Newfoundland and Labrador   |                            5 |
| Northwest Territories       |                            1 |
| Nova Scotia                 |                           10 |
| Ontario                     |                           70 |
| Prince Edward Island        |                            4 |
| Quebec                      |                           44 |
| Saskatchewan                |                            1 |
| Yukon                       |                            1 |

> Finished chain.
{'input': "Provide a table showing how many Liberals were elected by 'Province / Territory'.", 'output': '| Province / Territory     |   Number of Liberals Elected |\n|:----------------------------|-----------------------------:|\n| Alberta                     |                            2 |\n| British Columbia           |                           20 |\n| Manitoba                    |                            6 |\n| New Brunswick               |                            6 |\n| Newfoundland and Labrador   |                            5 |\n| Northwest Territories       |                            1 |\n| Nova Scotia                 |                           10 |\n| Ontario                     |                           70 |\n| Prince Edward Island        |                            4 |\n| Quebec                      |                           44 |\n| Saskatchewan                |                            1 |\n| Yukon                    

All that stuff above the row ‘> Finished chain’ is the content of the agent’s scratchpad as it works at generating the final response. The first list, above ‘dtype: int64’ is in fact a printout of the Pandas dataframe that was generated by the query displayed above it. The second list is the table it generated, which is also returned in the final/finished output.

Let’s try that without verbose mode, and just print the value of the ‘output’ key.

csv_exec = create_csv_agent(
  model,
  "rek/constituencies.csv",
  allow_dangerous_code=True,
  agent_type="openai-tools",
)

try:
  resp = csv_exec.invoke({"input": "Provide a table showing how many Liberals were elected by 'Province / Territory'."})
  print(resp['output'])
except Exception as e:
  print(f"Error: {e}")

And…

(agnt-3.12) PS R:\learn\ds_agent> python chat_bot_2.py
| Province / Territory      |   Liberal |
|:----------------------------|----------:|
| Alberta                     |         2 |
| British Columbia           |        20 |
| Manitoba                    |         6 |
| New Brunswick              |         6 |
| Newfoundland and Labrador   |         5 |
| Northwest Territories       |         1 |
| Nova Scotia                 |        10 |
| Ontario                     |        70 |
| Prince Edward Island        |         4 |
| Quebec                      |        44 |
| Saskatchewan                |         1 |
| Yukon                       |         1 |

Not Really Working Well

I don’t think this is the best way to go about accessing the CSV files. But really can’t find a way to do it differently without just using Pandas dataframes. Also, if I had used document/text files on my PC, I would have had to go through a number of additional steps.

Done

With local documents we would need to chunk and embed the files’ contents; adding that data/information to a vector store. And, as in the previous post we would need to create tools and bind them to the LLM model. Then create a LangGraph workflow with conditional nodes to make use of the tools. Finally, I would need to embed the input string and use that to search for appropriate chunks in the vector store before supplying everything to the chatbot.

So, I think I will call this one quits. And in the next post look at using the stored document approach. Not sure what that document might be at this point; but, that’s for another day.

My apologies for the poor quality post. But, sadly, life is full of trials and tribulations.

Resources

We use [Linguist](https://github.com/github-linguist/linguist) to perform language detection and to select third-party grammars for syntax highlighting. You can find out which keywords are valid in the languages YAML file.

Addendum

In case you are interested, my environment.yml file now looks like this.

name: agnt-3.12
channels:
  - conda-forge
dependencies:
  - beautifulsoup4>=4.13.4
  - langchain>=0.3.24
  - langchain-community>=0.3.23
  - langchain-core>=0.3.56
  - langchain-experimental>=0.3.4
  - langchain-mistralai>=0.2.10
  - langchain-openai>=0.3.15
  - langchain-text-splitters>=0.3.8
  - langgraph>=0.4.1
  - pandas>=2.2.3
  - python>=3.12.10
  - python-dotenv>=1.1.0
  - scikit-learn>=1.6.1
  - tabulate>=0.9.0
  - pip>=25.1.1
  - pip:
    - tavily-python>=0.7.2