Version: 0.96

snorkelflow.rich_docs.RichDocCols

class snorkelflow.rich_docs.RichDocCols

Bases: object

Base class that specifies Rich Doc columns.

__init__()

Methods

__init__()

Attributes

`CHECKBOXES`	A `Layout` object that captures checkboxes in a doc; requires a CheckboxFeaturizer in the DAG
`CONTEXT_PAGES`	JSON dump containing the positions of words, lines, etc
`DOC_COL`	The RichDoc object, which can be used to extract properties of the document
`HV_LINES`	A `HVLines` object that captures the vertical/horiztonal lines in a doc; requires a LinesFeaturizer in the DAG
`IS_BOTTOM_CHECKBOX_CHECKED`	A boolean field that denotes whether a bottom checkbox is checked or not
`IS_CHECKED`	A boolean field that denotes whether a checkbox is checked or not
`IS_LEFT_CHECKBOX_CHECKED`	A boolean field that denotes whether a left checkbox is checked or not
`IS_RIGHT_CHECKBOX_CHECKED`	A boolean field that denotes whether a right checkbox is checked or not
`IS_TABLE_SPAN`
`IS_TOP_CHECKBOX_CHECKED`	A boolean field that denotes whether a top checkbox is checked or not
`JSON_COL`
`PAGE_CHAR_STARTS`	A list of character offsets that denote where in the rich_doc_text each page starts
`PAGE_DOCS`	A list of RichDoc objects where each item corresponds to the RichDoc of a single page, in order
`PDF_URL_COL`	The URL/file location of the PDF
`PKL_COL`	Serialized version of the RichDoc object
`SPAN_END_CHAR_OFFSET`	The offset from the char_end of the last word in a span
`SPAN_END_WORD_ID`	The id of the word at the end of the span, relative to all other words in the document
`SPAN_NGRAM`	An n-gram of the words in the span; See `Ngram` for more
`SPAN_PAGE_ID`	The page # that the span belongs to
`SPAN_START_CHAR_OFFSET`	The offset from the char_start of the first word in a span
`SPAN_START_WORD_ID`	The id of the word at the start of the span, relative to all other words in the document
`TABLES`
`TABLE_COLUMN_ID`
`TEXT_CLUSTERS`	A `TextClusters` object object that captures horizontal clusters of words; requires a TextClusterer in the DAG
`TEXT_CLUSTER_ID`	A global cluster id associated with each text cluster
`TEXT_COL`	Extracted plain text from the PDF
`TEXT_REGION_ID`	When LinesFeaturizer is added, the user can group vertical and horizontal lines into regions; this field denotes which region is associate with a span

CHECKBOXES = 'checkboxes': A Layout object that captures checkboxes in a doc; requires a CheckboxFeaturizer in the DAG

CONTEXT_PAGES = 'context_pages': JSON dump containing the positions of words, lines, etc

DOC_COL = 'rich_doc': The RichDoc object, which can be used to extract properties of the document

HV_LINES = 'hv_lines': A HVLines object that captures the vertical/horiztonal lines in a doc; requires a LinesFeaturizer in the DAG

IS_BOTTOM_CHECKBOX_CHECKED = 'is_bottom_checkbox_checked': A boolean field that denotes whether a bottom checkbox is checked or not

IS_CHECKED = 'is_checked': A boolean field that denotes whether a checkbox is checked or not

IS_LEFT_CHECKBOX_CHECKED = 'is_left_checkbox_checked': A boolean field that denotes whether a left checkbox is checked or not

IS_RIGHT_CHECKBOX_CHECKED = 'is_right_checkbox_checked': A boolean field that denotes whether a right checkbox is checked or not

IS_TABLE_SPAN = 'is_table_span'

IS_TOP_CHECKBOX_CHECKED = 'is_top_checkbox_checked': A boolean field that denotes whether a top checkbox is checked or not

JSON_COL = 'rich_doc_json'

PAGE_CHAR_STARTS = 'page_char_starts': A list of character offsets that denote where in the rich_doc_text each page starts

PAGE_DOCS = 'page_docs': A list of RichDoc objects where each item corresponds to the RichDoc of a single page, in order

PDF_URL_COL = 'rich_doc_pdf_url': The URL/file location of the PDF

PKL_COL = 'rich_doc_pkl': Serialized version of the RichDoc object

SPAN_END_CHAR_OFFSET = 'rich_doc_span_end_char_offset': The offset from the char_end of the last word in a span

SPAN_END_WORD_ID = 'rich_doc_span_end_word_id': The id of the word at the end of the span, relative to all other words in the document

SPAN_NGRAM = 'rich_doc_span_ngram': An n-gram of the words in the span; See Ngram for more

SPAN_PAGE_ID = 'rich_doc_span_page_id': The page # that the span belongs to

SPAN_START_CHAR_OFFSET = 'rich_doc_span_start_char_offset': The offset from the char_start of the first word in a span

SPAN_START_WORD_ID = 'rich_doc_span_start_word_id': The id of the word at the start of the span, relative to all other words in the document

TABLES = 'tables'

TABLE_COLUMN_ID = 'table_column_id'

TEXT_CLUSTERS = 'text_clusters': A TextClusters object object that captures horizontal clusters of words; requires a TextClusterer in the DAG

TEXT_CLUSTER_ID = 'text_cluster_id': A global cluster id associated with each text cluster

TEXT_COL = 'rich_doc_text': Extracted plain text from the PDF

TEXT_REGION_ID = 'text_region_id': When LinesFeaturizer is added, the user can group vertical and horizontal lines into regions; this field denotes which region is associate with a span

\_\_init\_\_

__init__​

init