snorkelflow.operators.pandas_operator
- class snorkelflow.operators.pandas_operator(*, input_schema, name=None, resources=None, resources_fn=None, output_schema=None)
Bases:
objectDecorator that wraps a function mapping a
pandas.DataFrameto anotherpandas.DataFrame.While
@pandas_operatorallows you to define custom operators using simple Pandas syntax, they are automatically executed and parallelized using Dask under the hood.Examples
In the following example, a function that adds a random number to each row is defined. An
output_schemais explicitly added to indicate the column schema of the resulting DataFrame.This is used for validation and may provide you with helpful error messages.
from snorkelflow.operators import pandas_operator
@pandas_operator(name="Add Random Number", input_schema={}, output_schema={"rand_num": float})
def add_rand_num(df: pd.DataFrame) -> pd.DataFrame:
import random
df["rand_num"] = [random.random() for _ in range(len(df))]
return df
sf.add_operator(add_rand_num)Parameters
Parameters
Name Type Default Info name Optional[str]NoneName of the Operator. resources Optional[Dict[str, Any]]NoneResources passed in to fviakwargsresources_fn Optional[Callable[[], Dict[str, Any]]]NoneA function for generating a dictionary of values passed to fviakwargs, that are too expensive to serialize as resources.input_schema Dict[str, Any]Dictionary mapping from column to dtype, used to validate the dtypes of the input dataframe. output_schema Optional[Dict[str, Any]]NoneDictionary mapping from column to dtype, used to validate the dtypes of the output dataframe.
If not
None, thenfmust not delete any dataframe columns, and all new columns must be specified along with types inoutput_schema.- __init__(*, input_schema, name=None, resources=None, resources_fn=None, output_schema=None)
\_\_init\_\_
__init__
Methods
__init__(*, input_schema[, name, resources, ...])