snorkelflow.rich_docs
Utilities to manipulate Rich Document structures for labeling functions and operators in Snorkel Flow.
Labeling functions that use rich documents
Labeling functions that use rich documents
When rich document information is available, the RichDoc
object
is available for you to write labeling functions over. We also provide a
RichDocWrapper
class that provides additional helpers and derived properties
for a given RichDoc
object. You can import these classes and use them to
operate over the serialized RichDoc
object.
from snorkelflow.lfs.lfs import labeling_function
@labeling_function(name="rich_doc_lf")
def rich_doc_lf(x):
from snorkelflow.rich_docs import RichDocCols, RichDoc
from rich_doc_wrapper import RichDocWrapper
# Deserialize the RichDoc object
rd = x[RichDocCols.PAGE_DOCS].apply(lambda x: RichDoc.from_page_docs(x))
# Wrap RichDoc object and text in RichDocWrapper class
rdw = RichDocWrapper(rd)
# Get the left bounding box coordinate for the given span
span_left = rdw.get_span_ngram(x["char_start"], x["char_end"]).left
# Label INVALID if span bounding box begins too far to the left
if span_left < 1100:
return "INVALID"
return "UNKNOWN"
sf.add_code_lf(node, rich_doc_lf, label="INVALID")
Classes
| Serializable wrapper for PDF structures detected from DocumentLayoutFeaturizer. |
| Serializable wrapper for lists of horizontal and vertical lines in image. |
| An object representing a document with rich formatting preserved. |
Base class that specifies Rich Doc columns. | |
| Serializable wrapper for list of RichDoc's. |
Interface for types that have a custom serialization / deserialization function. | |
| Serializable wrapper for horizontal clusters of words. |
Exceptions
An Exception raised when a RichDoc was expected and could not be found. |