Using LangSmith to test LLMs and AI applications

See how LogRocket's Galileo AI surfaces the most severe issues for you

No signup required

Check it out

The performance and reliability of language models have become pivotal factors in the rapidly evolving AI and NLP landscapes. Testing these models thoroughly to ensure they meet user expectations is a critical step in their development.

This is where LangSmith comes into play. LangSmith is a dynamic testing framework that offers a powerful solution to assess the capabilities of language models and AI applications.

In this tutorial, we’ll explore the process of effectively utilizing LangSmith to test and evaluate language models, providing valuable insights into their performance, strengths, and limitations. With LangSmith as your ally, you’ll be equipped to refine and enhance your AI applications, optimizing them for real-world usage scenarios.

To follow along with this tutorial, you’ll need the following:

Basic knowledge of natural language processing concepts
A working knowledge of Python

Let’s get started!

Jump ahead:

What is LangSmith?
Building a chat-based AI environment with LangChain
Evaluating and testing LLM applications using LangSmith
Future possibilities

What is LangSmith?

LangSmith is an innovative and dynamic testing framework for evaluating language models and AI applications. As a platform, it is capable of building production-grade LLM applications.

In the realm of language model testing, LangSmith emerges as a robust and versatile testing framework. It provides tools that analyze and extract valuable insights from model responses, thereby assisting developers in refining their models for enhanced real-world interactions.

The LangSmith platform builds on top of LangChain. Simply put, LangSmith is for building production, whereas LangChain is for creating prototypes.

LangChain tracing tools are invaluable for investigating and debugging an agent’s execution steps. To gain a comprehensive understanding of chains or agents’ workflows, LangChain offers a tracing tool that enables us to visualize the sequence of calls effectively.

The LangChain tracing tool enables users to readily observe the methodology employed to arrive at its conclusions, fostering a deeper understanding of the model’s inner workings and building confidence in the accuracy of its responses. For instance, when different LLMs are used, the actions must be reviewed, and a method is required to check if the building blocks are working correctly.

One possible solution for implementing LangChain tracing is the Tracing Server, which provides two options: local-hosted via the LangChain Server Command and cloud-hosted via a Vercel app.

As we further explore the functionalities of LangSmith, its true potential for testing and enhancing language models and AI applications will become increasingly evident.

Here are some key features of LangSmith as a testing tool:

Quick debugging: Enhances the efficiency and performance of new chains and other sets of tools effortlessly
Customizable test scenarios: Users can create and run various testing scenarios with customization options
Metrics and analytics: Measures the model’s strengths and weaknesses using various metrics and analytics
Interactive visualization: Provides a visualization capability that enhances the understanding of response patterns and trends, aiding in performance evaluation

Building a chat-based AI environment with LangChain

It’s time for the tutorial portion of this article. Let’s investigate how to use LangSmith to test and evaluate language models.

Over 200k developers use LogRocket to create better digital experiences

Learn more →

Setting up the development environment

To get started setting the development environment, you’ll need to create a LangSmith account and verify your account to join the waitlist:

LangSmith Welcome Message

The next step involves initializing the environment, generating API keys, and ensuring their secure storage.

Creating a LangChain component

To create a LangChain component, install the most recent version of LangChain in your environment using Python or any other preferred programming language of your choice:

    pip install - U langchain

Next, configure your runtime environment by following these three steps:

Replace the your-api-key placeholder with your generated API keys
Set the URL for LANGCHAIN_ENDPOINT and enable LANGCHAIN_TRACING_V2 by setting it to true
Specify LANGCHAIN_PROJECT to define the particular project you will be working on. This ensures that LangChain logs your runs to the intended project. Without this setting, the runs will be logged to a default project

By following the above steps, you can automatically create the project, regardless of whether it already exists. You can accomplish this by following the shell commands provided below:

export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
export LANGCHAIN_API_KEY=<your-api-key>
export LANGCHAIN_PROJECT=<your-project>  # if not specified, defaults to "default"

Creating a LangSmith client

Next, create a LangSmith client to interact with the API:

from langsmith import Client

client = Client()

If you’re using Python, run the following commands to import the module:

from langchain.chat_models import ChatOpenAI

from langchain.agents import AgentType, initialize_agent, load_tools

The above code shows how the LangChain library is used to import the necessary modules and classes for this project: ChatOpenAI ,AgentType, initialize_agent, and load_tools.

Next, let’s create an instance of the ChatOpenAI class and store it in a llb variable. The temperature parameter will be set to 0, implying that the generated response will be more deterministic as shown in the code snippet below:

 llm = ChatOpenAI(temperature=0)

Now, let’s call the load_tools function with a list of tool APIs, such as serpapi and llm-math, and also take the llm instance as a parameter. This function helps to load the specific language models and tools required for the task as shown in the code snippet below:

tools = load_tools(["serpapi", "llm-math"], llm=llm)

As a next step, initialize an agent by calling the initialize_agent function with several parameters like tools, llms, and agent:

 agent = initialize_agent(
    tools, llm, agent=AgentType.PROJECT_DESCRIPTION, verbose=False
)

The verbose parameter is set to false, indicating that the agent will not provide verbose or detailed output.

Below is the full code that’s used to set up a chat-based AI environment with LangChain. It initializes a chat model, loads specific tools, and creates an agent that can generate responses based on descriptions:

from langchain.chat_models import ChatOpenAI
from langchain.agents import AgentType, initialize_agent, load_tools

llm = ChatOpenAI(temperature=0)
tools = load_tools(["serpapi", "llm-math"], llm=llm)
agent = initialize_agent(
    tools, llm, agent=AgentType.PROJECT_DESCRIPTION, verbose=False
)

Input processing with exception handling

The code below defines a list of input examples using the asyncio library to asynchronously run the agent on each input and gather the results for further processing. This code also handles exceptions that may occur during the agent execution:

import asyncio

# Define a list of input examples
inputs = [
    "How many go on vacation to Maldives in 2023?",
    "Who is the faster man in the World?",
    "What is Obama's first daughter's age?",
    "What is the distance of Africa to America?",
    "What was the number of games Lionel Messi played in Inter Miami?",
    "What was the total number of male in America?",
    "How many artistes performed at 02 Arena in London in 2023?",
    "What is 10 raised to power of 20?",
    "Who is the best international news station?",
    "What is value of 10000 divided by 2?",
]

results = []

async def run_agent(agent, input_example):
    try:
        return await agent.run(input_example)
    except Exception as e:
        # Handle exceptions, if any
        return e

# Create an event loop and gather results
async def main():
    loop = asyncio.get_event_loop()
    for input_example in inputs:
        results.append(loop.create_task(run_agent(agent, input_example)))
    await asyncio.gather(*results)

# Run the main event loop
if __name__ == "__main__":
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

It’s also important to call the wait_for_all_tracers function from the langchain.callbacks.tracers.langchain module as shown in the code snippet below:

from langchain.callbacks.tracers.langchain import wait_for_all_tracers
wait_for_all_tracers()

Calling the wait_for_all_tracers function helps ensure that logs and traces are submitted in full before the program proceeds. This is particularly useful for maintaining data consistency and completeness in logging in scenarios involving background threads or serverless deployments.

Congratulations on reaching this milestone! Assuming you’ve completed setting up your environment, your agent traces should now appear in the “Projects” section of the app. Now, let’s take the next exciting step and dive into the world of AI application evaluation!

Evaluating and testing AI applications using LangSmith

LangSmith allows you to evaluate and test your LLM applications using LangSmith dataset. To demonstrate this, we‘ll evaluate another agent by creating a LangSmith dataset and configuring the evaluators to grade the agent’s output.

These datasets can be categorized as kv, llm, and chat. They refer to a collection of examples with input and output pairs that can be used to evaluate or test an agent or model. To gain a deeper understanding of evaluating a LangSmith dataset, let’s create the dataset, initialize new agents, and customize and configure the evaluation output.

Creating a LangSmith dataset

To create our dataset, we’ll collect examples from the existing runs we created earlier. To do this, click on the Add to Dataset button at the top right corner to make a selection from the existing runs.

The objective is to use these datasets to measure the performance of the agent:

dataset_name = f"calculator-example-dataset-{unique_identity}"

dataset = client.create_dataset(
    dataset_name, description="A calculator dataset"
)

runs = client.list_runs(
    project_name=os.environ["LANGCHAIN_PROJECT"],
    execution_order=1,
    error=False,
)
for run in runs:
    client.create_example(inputs=run.inputs, outputs=run.outputs, dataset_id=dataset.id)

Next, we’ll evaluate our LLM by initializing a new agent to benchmark. To do this, we’ll pass in a constructor called chain_factory to initialize each call.

Initializing new agents

Our evaluation will focus on an agent utilizing OpenAI’s function calling endpoints as shown in the code snippet below:

from langchain.chat_models import ChatOpenAI
from langchain.agents import AgentType, initialize_agent, load_tools

llm = ChatOpenAI(model="gpt-3.5-turbo-0613", temperature=0)
tools = load_tools(["serpapi", "llm-math"], llm=llm)

def agent_factory():
    return initialize_agent(tools, llm, agent=AgentType.OPENAI_FUNCTIONS, verbose=False)

Customizing and configuring evaluation output

Automated metrics and AI-guided feedback can be real game changers when it comes to assessing how well your component is doing.

Pre-implemented run evaluators can do some really cool things:

They get into the nitty-gritty of your agent’s responses, all without needing any references, thanks to a custom criteria feature
They gauge how similar (or different) your content is from a semantic standpoint using some smart embedding distance calculations
They check your results against the actual truth labels

Here’s some sample code that demonstrates how to create pre-implemented run evaluators that add inputs and receive outputs of a dataset:

from langchain.evaluation import EvaluatorType
from langchain.smith import RunEvalConfig

evaluation_config = RunEvalConfig(

    evaluators=[

        EvaluatorType.QA,

        EvaluatorType.EMBEDDING_DISTANCE,

        RunEvalConfig.LabeledCriteria("helpfulness"),

        RunEvalConfig.Criteria(
            {
                "fifth-grader-score": "Do you have to be smarter than a fifth grader to answer this question?"
            }
        ),
    ],
    custom_evaluators=[],
)

LangSmith Adding Inputs Receiving Outputs Dataset

N.B., for an in-depth exploration of how to choose the best evaluator for your specific needs and how to craft custom evaluators, check out the LangSmith documentation.

Executing the agent and evaluator

Now that you’ve successfully crafted and fine-tuned your custom evaluator to perfectly align with your specific requirements, it’s time to put your model to the test!

You can use the arun_on_dataset function (or the synchronous run_on_dataset if you prefer) to make it happen. This will:

Grab some example rows from any dataset you specify
Put your llm or chain to work on each of those examples
Finally, the evaluators will kick in, giving your model a thorough check and providing some handy automated feedback

The below code demonstrates how to execute the agent and evaluator:

from langchain.smith import (
    arun_on_dataset,
    run_on_dataset, 
)

chain_results = await arun_on_dataset(
    client=client,
    dataset_name=dataset_name,
    llm_or_chain_factory=agent_factory,
    evaluation=evaluation_config,
    verbose=True,
    tags=["testing-notebook"], 
)

Guess what? You’ll get to witness this exciting action and its results directly within the LangSmith app. You can access the LangSmith test results tracing UI by following these steps:

Go to the Datasets & Testing page
Select the calculator-dataset- dataset
Click the Test Runs tab
Inspect the runs within the corresponding project

LangSmith offers convenient data export options within the web application, allowing you to export data in popular formats like CSV or JSONL. Additionally, you can utilize the client to retrieve runs for in-depth analysis, store them in your preferred database, or easily share them with colleagues and collaborators.

Future possibilities

AI applications will play a pivotal role in automating business operations and elevating project execution efficiency. As they continue to gain traction, their adoption is expected to surge.

Platforms like LangSmith and its counterparts have the potential to introduce a multitude of enhancements, including but not limited to:

Advanced AI models
Customizable workflow
Improved data integration
AutoML and model deployment
Integration with IoT and edge devices
Advanced analytics and insights
Enhanced collaboration features

Conclusion

In this article, we highlighted essential aspects of using LangSmith for LLM and AI application evaluation. We introduced LangSmith and then did a deep dive into setting up the development environment. and successfully evaluating and testing an AI application.

Throughout this article, we emphasized LangSmith’s pivotal role in effective testing, underlining its significance in ensuring reliable AI models. We used Python as our primary programming language to achieve this. You can explore LangSmith for more comprehensive model evaluation and other user-friendly features.

I hope you’ve found this article useful. Please leave a comment if you have any questions or concerns. Happy coding!

Get set up with LogRocket's modern error tracking in minutes:

Visit https://logrocket.com/signup/ to get an app ID

Install LogRocket via npm or script tag. LogRocket.init() must be called client-side, not server-side

npm
Script tag

$ npm i --save logrocket 

// Code:

import LogRocket from 'logrocket'; 
LogRocket.init('app/id');

// Add to your HTML:

<script src="https://cdn.lr-ingest.com/LogRocket.min.js"></script>
<script>window.LogRocket && window.LogRocket.init('app/id');</script>

(Optional) Install plugins for deeper integrations with your stack:
- Redux middleware
- NgRx middleware
- Vuex plugin

Get started now

Windsurf vs. Cursor: When to choose the challenger

Windsurf AI brings agentic coding and terminal control right into your IDE. We compare it to Cursor, explore its features, and build a real frontend project.

Chizaram Ken

Jul 31, 2025 ⋅ 9 min read

The CSS `if()` function: Conditional styling will never be the same

The CSS Working Group has approved the if() function for development, a feature that promises to bring true conditional styling directly to our stylesheets.

Ikeh Akinyemi

Jul 30, 2025 ⋅ 12 min read

Next.js 15.4 is here: What’s new and what to expect

Next.js 15.4 is here, and it’s more than just a typical update. This version marks a major milestone for the framework and its growing ecosystem.

Abiola Farounbi

Jul 29, 2025 ⋅ 6 min read

Build interactive React UIs for LLM outputs using llm-ui

If you’re building an LLM-powered application, llm-ui is a powerful tool to help you add structure, flexibility, and polish to your AI interfaces.

Emmanuel John

Jul 29, 2025 ⋅ 9 min read

View all posts

Advisory boards aren’t only for executives. Join the LogRocket Content Advisory Board today →

Using LangSmith to test LLMs and AI applications

See how LogRocket's Galileo AI surfaces the most severe issues for you

No signup required

What is LangSmith?

Building a chat-based AI environment with LangChain

Over 200k developers use LogRocket to create better digital experiences

Setting up the development environment

Creating a LangChain component

Creating a LangSmith client

Input processing with exception handling

Evaluating and testing AI applications using LangSmith

Creating a LangSmith dataset

Initializing new agents

Customizing and configuring evaluation output

Executing the agent and evaluator

Future possibilities

Conclusion

Get set up with LogRocket's modern error tracking in minutes:

Stop guessing about your digital experience with LogRocket

Recent posts:

Windsurf vs. Cursor: When to choose the challenger

The CSS `if()` function: Conditional styling will never be the same

Next.js 15.4 is here: What’s new and what to expect

Build interactive React UIs for LLM outputs using llm-ui

Leave a ReplyCancel reply

Advisory boards aren’t only for executives. Join the LogRocket Content Advisory Board today →

See how LogRocket's Galileo AI surfaces the most severe issues for you

No signup required

What is LangSmith?

Building a chat-based AI environment with LangChain

Over 200k developers use LogRocket to create better digital experiences

Setting up the development environment

Creating a LangChain component

Creating a LangSmith client

Input processing with exception handling

Evaluating and testing AI applications using LangSmith

Creating a LangSmith dataset

Initializing new agents

Customizing and configuring evaluation output

Executing the agent and evaluator

Future possibilities

Conclusion

Get set up with LogRocket's modern error tracking in minutes:

Stop guessing about your digital experience with LogRocket

Recent posts:

Windsurf vs. Cursor: When to choose the challenger

The CSS if() function: Conditional styling will never be the same

Next.js 15.4 is here: What’s new and what to expect

Build interactive React UIs for LLM outputs using llm-ui

Leave a ReplyCancel reply

The CSS `if()` function: Conditional styling will never be the same