snorkelflow.operators.pandas_operator
- class snorkelflow.operators.pandas_operator(*, input_schema, name=None, resources=None, resources_fn=None, output_schema=None)
Bases:
object
Decorator that wraps a function mapping a
pandas.DataFrame
to anotherpandas.DataFrame
.While
@pandas_operator
allows you to define custom operators using simple Pandas syntax, they are automatically executed and parallelized using Dask under the hood.Examples
In the following example, a function that adds a random number to each row is defined. An
output_schema
is explicitly added to indicate the column schema of the resulting DataFrame.This is used for validation and may provide you with helpful error messages.
from snorkelflow.operators import pandas_operator
@pandas_operator(name="Add Random Number", input_schema={}, output_schema={"rand_num": float})
def add_rand_num(df: pd.DataFrame) -> pd.DataFrame:
import random
df["rand_num"] = [random.random() for _ in range(len(df))]
return df
sf.add_operator(add_rand_num)- Parameters:
name (
Optional
[str
], default:None
) – Name of the Operatorresources (
Optional
[Dict
[str
,Any
]], default:None
) – Resources passed in tof
viakwargs
resources_fn (
Optional
[Callable
[[],Dict
[str
,Any
]]], default:None
) – A function for generating a dictionary of values passed tof
viakwargs
, that are too expensive to serialize as resources.input_schema (
Dict
[str
,Any
]) – Dictionary mapping from column to dtype, used to validate the dtypes of the input dataframe.output_schema (
Optional
[Dict
[str
,Any
]], default:None
) –Dictionary mapping from column to dtype, used to validate the dtypes of the output dataframe.
If not
None
, thenf
must not delete any dataframe columns, and all new columns must be specified along with types inoutput_schema
.
- __init__(*, input_schema, name=None, resources=None, resources_fn=None, output_schema=None)
Methods
__init__
(*, input_schema[, name, resources, ...])