Six Years as a scikit-learn maintainer - Feature Retrospective

May 15, 2025
python scikit-learn

It's hard to believe that it's been over six years since I joined the scikit-learn team as a maintainer. As of today, I have 1,374 commits and reviewed 3,179 pull requests. Behind these numbers, I am grateful for all the thoughtful discussions I have had with the community to push scikit-learn forward. Reviewing my commits, I would like to showcase some of my favorite features that I worked on:

1. Everything Trees 🌲🌲🌲

  • Native categorical support in Histogram-based Gradient Boosting Trees. (gh-26411, gh-18394)
  • Native missing value support in Random Forest & Trees. (gh-23595, gh-26391)
  • Cost complexity pruning In Trees. (gh-12887)

2. DataFrame interoperability 🖼️

  • Pandas and Polars DataFrame output with the set_output API. (gh-27315)
  • get_feature_names_out: Mapping input feature names to output feature names. (gh-18444)

3. Preprocessing 🕰️

  • TargetEncoder: Use the target to encode categorical data. (gh-25334)
  • Group infrequent categories in OrdinalEncoder and OneHotEncoder. (gh-25677)
  • KNN-based missing value imputation. (gh-12852)

4. Visualizations 📊

  • HTML Representation to visualize estimators in Jupyter notebooks. (gh-14180)
  • Plotting API for evaluating or inspecting estimators. (gh-14357)

5. Experimental GPU support 🏎️

  • Integrate Array API to run natively with PyTorch or CuPy arrays on a GPU. (gh-22554)

I hope you found some of these features useful or discovered some of them here 😁.

Similar Posts

04/26/25
torch.export For Serializing Models and Faster Loading
04/06/25
Keep Warm with Portable torch.compile Caches
03/16/25
PyTorch Graphs Three Ways: Data-Dependent Control Flow
12/27/23
Python Extensions in Rust with Jupyter Notebooks
08/15/23
Quick NumPy UFuncs with Cython 3.0