Version: 25.4

snorkelflow.client.analyses.get_ngram_counts

snorkelflow.client.analyses.get_ngram_counts(node, df, field, label=None, stop_words='english', ngram_range=(1, 3))

Return a dictionary mapping n-gram name to its count in dataset.

Filter by ground truth if specified.

Examples

For example, if we want to see all the n-grams that have a significantly higher relative count for EMPLOYMENT than for all classes in the dev split.

import matplotlib.pyplot as plt
%matplotlib inline

df = sf.get_dataset(node, split="dev")

counts_dict = sf.get_ngram_counts(node, df, field="text")
class_counts_dict = sf.get_ngram_counts(
    node,
    df,
    field="text",
    label="employment"
)
sf.plot_distinctive_ngram_histogram(
    node,
    counts_dict,
    class_counts_dict,
    num_ngrams=10
)

Parameters Parameters
Raises Raises: ValueError – If label in not in the set of valid labels for task If field in not in the set of columns in df
Returns Returns: A dictionary where n-gram names are keys and counts are values
Return type Return type: dict

Name	Type	Default	Info
node	`int`		UID of the node.
df	`DataFrame`		The data frame with a text field over which we calculate n-gram counts.
field	`str`		Text field in df over which we calculate n-gram counts.
label	`Optional[str]`	`None`	Label string to use to filter data points where ground truth is this value for calculating n-gram counts.
stop_words	`Optional[str]`	`'english'`	Words to ignore when building n-gram counts. Default to `"english"`. See `sklearn.CountVectorizer` documentation.
ngram_range	`Optional[Tuple]`	`(1, 3)`	The lower and upper boundary of the range of n-grams to consider. See `sklearn.CountVectorizer` documentation.

Examples​

Parameters

Parameters​

Raises

Raises​

Returns

Returns​

Return type

Return type​

Examples

Parameters

Raises

Returns

Return type