Skip to main content
Version: 25.4

snorkelflow.client.analyses.get_ngram_counts

snorkelflow.client.analyses.get_ngram_counts(node, df, field, label=None, stop_words='english', ngram_range=(1, 3))

Return a dictionary mapping n-gram name to its count in dataset.

Filter by ground truth if specified.

Examples

For example, if we want to see all the n-grams that have a significantly higher relative count for EMPLOYMENT than for all classes in the dev split.

import matplotlib.pyplot as plt
%matplotlib inline

df = sf.get_dataset(node, split="dev")

counts_dict = sf.get_ngram_counts(node, df, field="text")
class_counts_dict = sf.get_ngram_counts(
node,
df,
field="text",
label="employment"
)
sf.plot_distinctive_ngram_histogram(
node,
counts_dict,
class_counts_dict,
num_ngrams=10
)

Parameters

NameTypeDefaultInfo
nodeintUID of the node.
dfDataFrameThe data frame with a text field over which we calculate n-gram counts.
fieldstrText field in df over which we calculate n-gram counts.
labelOptional[str]NoneLabel string to use to filter data points where ground truth is this value for calculating n-gram counts.
stop_wordsOptional[str]'english'Words to ignore when building n-gram counts. Default to "english". See sklearn.CountVectorizer documentation.
ngram_rangeOptional[Tuple](1, 3)The lower and upper boundary of the range of n-grams to consider. See sklearn.CountVectorizer documentation.

Raises

ValueError – If label in not in the set of valid labels for task If field in not in the set of columns in df

Returns

A dictionary where n-gram names are keys and counts are values

Return type

dict