Debug Step for Polars Pipelines
In my work I often am writing long pipelines to process polars dataframes. Sometimes, there’s a bug and I’m not sure where its happening or have an easy way to to find it.
Oftentimes, I would break the pipeline and put in an IPython embed()
statement. Using embed is really nice because it gives you an interactive shell in the middle of your program which you can use to see what your dataframe is looking like and play around with it. You can filter to check if certain rows or conditions exist, and look out for your edge cases. The downside has been that I would have to break up my pipeline if I want to use the embed()
function. Here’s an example:
import polars as pl
from IPython import embed
# First part of process I want to debug
= (
d 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-07-01/weekly_gas_prices.csv')
pl.read_csv(
.with_columns("date").str.to_date(format='%Y-%m-%d').alias('dt')
pl.col(
)='dt', descending=False)
.sort(by
.with_columns("price").shift(1).over(['fuel','grade','formulation'])).alias("prior_week_price")
(pl.col(
)
)
print(">>EMBED After shift")
embed()
# Continue the rest
= (
d
d
.with_columns("price") - pl.col("prior_week_price")).alias("delta_from_prior_week")
(pl.col(
,"price") / pl.col("prior_week_price") - 1).alias("delta_percent_from_prior_week")
(pl.col(
)
)
print(">>EMBED after complete process")
embed()
I stumbled upon this post by Vincent Warmerdam implementing a show method for polars and inspired me to swith the print statement out for embed()
.
import polars as pl
from IPython import embed
def debug_step(d:pl.DataFrame, noteStr:str=None) -> pl.DataFrame:
# makes a copy of the dataframe incase I accidently overwrite the d variable
= d
_d_original if (noteStr != None):
print(noteStr)
embed()
return _d_original
Using this I can now run my process as normal without breaking it up by using the polars .pipe()
method to pass the defined debug_step
function.
= (
d 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-07-01/weekly_gas_prices.csv')
pl.read_csv(
.with_columns("date").str.to_date(format='%Y-%m-%d').alias('dt')
pl.col(
)='dt', descending=False)
.sort(by
.with_columns("price").shift(1).over(['fuel','grade','formulation'])).alias("prior_week_price")
(pl.col(
)=">>EMBED after shift")
.pipe( debug_step , noteStr
.with_columns("price") - pl.col("prior_week_price")).alias("delta_from_prior_week")
(pl.col(
,"price") / pl.col("prior_week_price") - 1).alias("delta_percent_from_prior_week")
(pl.col(
)=">>EMBED column creation")
.pipe( debug_step , noteStr )
This has been a nice trick where I can plug the debug_step
function into my normal pipeline, run my script, and do some interactive debugging in the terminal. I find its also easy to turn off or on using comments before the debug step.