snorkelflow.client.analyses.get_ngram_counts
- snorkelflow.client.analyses.get_ngram_counts(node, df, field, label=None, stop_words='english', ngram_range=(1, 3))
Return a dictionary mapping n-gram name to its count in dataset.
Filter by ground truth if specified.
Examples
For example, if we want to see all the n-grams that have a significantly higher relative count for
EMPLOYMENT
than for all classes in the dev split.import matplotlib.pyplot as plt
%matplotlib inline
df = sf.get_dataset(node, split="dev")
counts_dict = sf.get_ngram_counts(node, df, field="text")
class_counts_dict = sf.get_ngram_counts(
node,
df,
field="text",
label="employment"
)
sf.plot_distinctive_ngram_histogram(
node,
counts_dict,
class_counts_dict,
num_ngrams=10
)Parameters
Parameters
Raises
Raises
ValueError – If label in not in the set of valid labels for task If field in not in the set of columns in df
Returns
Returns
A dictionary where n-gram names are keys and counts are values
Return type
Return type
dict
Name Type Default Info node int
UID of the node. df DataFrame
The data frame with a text field over which we calculate n-gram counts. field str
Text field in df over which we calculate n-gram counts. label Optional[str]
None
Label string to use to filter data points where ground truth is this value for calculating n-gram counts. stop_words Optional[str]
'english'
Words to ignore when building n-gram counts. Default to "english"
. Seesklearn.CountVectorizer
documentation.ngram_range Optional[Tuple]
(1, 3)
The lower and upper boundary of the range of n-grams to consider. See sklearn.CountVectorizer
documentation.