Version: 25.1

snorkelflow.operators.pandas_featurizer

class snorkelflow.operators.pandas_featurizer(*, input_schema, name=None, resources=None, resources_fn=None, output_schema=None, is_join_featurizer=False)

Bases: object

Decorator for adding columns to a dataframe.

It can be used a create a Featurizer, namely an operator that only adds columns (features) to a dataframe, but does not add or delete any rows.

A pandas_featurizer must include an output schema, which should contain all the new columns added by the wrapped function. If it has an input schema, it should include all the columns needed by the function.

While @pandas_featurizer allows you to define custom operators using simple Pandas syntax, they are automatically executed and parallelized using Dask under the hood.

Examples

In the following example, a function that adds one to an integer column to produce another column is defined and wrapped with a @pandas_featurizer.

from snorkelflow.operators import pandas_featurizer

@pandas_featurizer(name="Add 1", input_schema={"mycol": int}, output_schema={"mycol2": int})
def add_one(df: pd.DataFrame) -> pd.DataFrame:
    return df.assign(mycol2=df.mycol + 1)

sf.add_operator(add_one)

Parameters Parameters

Name	Type	Default	Info
name	`Optional[str]`	`None`	Name of the Featurizer.
f			Function that accepts as input from `pd.DataFrame` and outputs a `pd.DataFrame`.
resources	`Optional[Dict[str, Any]]`	`None`	Resources passed in to `f` via `kwargs`
resources_fn	`Optional[Callable[[], Dict[str, Any]]]`	`None`	A function for generating a dictionary of values passed to `f` via `kwargs`, that are too expensive to serialize as resources.
input_schema	`Dict[str, Any]`		Dictionary mapping from column to dtype. This must include all the columns required by `f`.
output_schema	`Optional[Dict[str, Any]]`	`None`	Dictionary mapping from column to dtype. `f` must add exactly the columns specified here to the dataframe.
is_join_featurizer	`bool`	`False`	If True, then the join of the documents and candidates dataframes is given to is as input. Should be False for classification tasks.

__init__(*, input_schema, name=None, resources=None, resources_fn=None, output_schema=None, is_join_featurizer=False)

Methods

__init__(*, input_schema[, name, resources, ...])

Examples​

Parameters

Parameters​

\_\_init\_\_

__init__​

Examples

Parameters

init