Skip to main content
Version: 0.91

snorkelflow.rich_docs

Utilities to manipulate Rich Document structures for labeling functions and operators in Snorkel Flow.

Labeling functions that use rich documents

When rich document information is available, the RichDoc object is available for you to write labeling functions over. We also provide a RichDocWrapper class that provides additional helpers and derived properties for a given RichDoc object. You can import these classes and use them to operate over the serialized RichDoc object.

from snorkelflow.lfs.lfs import labeling_function

@labeling_function(name="rich_doc_lf")
def rich_doc_lf(x):
from snorkelflow.rich_docs import RichDocCols, RichDoc
from rich_doc_wrapper import RichDocWrapper

# Deserialize the RichDoc object
rd = x[RichDocCols.PAGE_DOCS].apply(lambda x: RichDoc.from_page_docs(x))
# Wrap RichDoc object and text in RichDocWrapper class
rdw = RichDocWrapper(rd)

# Get the left bounding box coordinate for the given span
span_left = rdw.get_span_ngram(x["char_start"], x["char_end"]).left
# Label INVALID if span bounding box begins too far to the left
if span_left < 1100:
return "INVALID"
return "UNKNOWN"

sf.add_code_lf(node, rich_doc_lf, label="INVALID")

Classes

DocumentLayout(structures)

Serializable wrapper for PDF structures detected from DocumentLayoutFeaturizer.

HVLines(dfs_horz, dfs_vert)

Serializable wrapper for lists of horizontal and vertical lines in image.

RichDoc(pages, areas, pars, lines, words[, ...])

An object representing a document with rich formatting preserved.

RichDocCols()

Base class that specifies Rich Doc columns.

RichDocList(rich_docs)

Serializable wrapper for list of RichDoc's.

Serializable()

Interface for types that have a custom serialization / deserialization function.

TextClusters(word_to_cluster, df_clusters)

Serializable wrapper for horizontal clusters of words.

Exceptions

MissingRichDocException

An Exception raised when a RichDoc was expected and could not be found.