Debug Step for Polars Pipelines

Small function to make debugging long polars pipelines easier

Published

July 30, 2025

In my work I often am writing long pipelines to process polars dataframes. Sometimes, there’s a bug and I’m not sure where its happening or have an easy way to to find it.

Oftentimes, I would break the pipeline and put in an IPython embed() statement. Using embed is really nice because it gives you an interactive shell in the middle of your program which you can use to see what your dataframe is looking like and play around with it. You can filter to check if certain rows or conditions exist, and look out for your edge cases. The downside has been that I would have to break up my pipeline if I want to use the embed() function. Here’s an example:

import polars as pl
from IPython import embed

# First part of process I want to debug
d = (
    pl.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-07-01/weekly_gas_prices.csv')
    .with_columns(
        pl.col("date").str.to_date(format='%Y-%m-%d').alias('dt')
    )
    .sort(by='dt', descending=False)
    .with_columns(
        (pl.col("price").shift(1).over(['fuel','grade','formulation'])).alias("prior_week_price")
    )
)

print(">>EMBED After shift")
embed()

# Continue the rest 
d = (
    d
    .with_columns(
        (pl.col("price") - pl.col("prior_week_price")).alias("delta_from_prior_week")
        ,
        (pl.col("price") / pl.col("prior_week_price") - 1).alias("delta_percent_from_prior_week")
    )
)

print(">>EMBED after complete process")
embed()

I stumbled upon this post by Vincent Warmerdam implementing a show method for polars and inspired me to swith the print statement out for embed().

import polars as pl
from IPython import embed

def debug_step(d:pl.DataFrame, noteStr:str=None) -> pl.DataFrame:

    # makes a copy of the dataframe incase I accidently overwrite the d variable
    _d_original = d 
    if (noteStr != None):
        print(noteStr)
    
    embed()

    return _d_original

Using this I can now run my process as normal without breaking it up by using the polars .pipe() method to pass the defined debug_step function.

d = (
    pl.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-07-01/weekly_gas_prices.csv')
    .with_columns(
        pl.col("date").str.to_date(format='%Y-%m-%d').alias('dt')
    )
    .sort(by='dt', descending=False)

    .with_columns(
        (pl.col("price").shift(1).over(['fuel','grade','formulation'])).alias("prior_week_price")
    )
    .pipe( debug_step , noteStr=">>EMBED after shift")
    .with_columns(
        (pl.col("price") - pl.col("prior_week_price")).alias("delta_from_prior_week")
        ,
        (pl.col("price") / pl.col("prior_week_price") - 1).alias("delta_percent_from_prior_week")
    )
    .pipe( debug_step , noteStr=">>EMBED column creation")
)

This has been a nice trick where I can plug the debug_step function into my normal pipeline, run my script, and do some interactive debugging in the terminal. I find its also easy to turn off or on using comments before the debug step.