snorkelflow.rich_docs.RichDocCols
- class snorkelflow.rich_docs.RichDocCols
Bases:
object
Base class that specifies Rich Doc columns.
- __init__()
\_\_init\_\_
__init__
Methods
__init__
()Attributes
CHECKBOXES
A Layout
object that captures checkboxes in a doc; requires a CheckboxFeaturizer in the DAGCONTEXT_PAGES
JSON dump containing the positions of words, lines, etc DOC_COL
The RichDoc object, which can be used to extract properties of the document HV_LINES
A HVLines
object that captures the vertical/horiztonal lines in a doc; requires a LinesFeaturizer in the DAGIS_BOTTOM_CHECKBOX_CHECKED
A boolean field that denotes whether a bottom checkbox is checked or not IS_CHECKED
A boolean field that denotes whether a checkbox is checked or not IS_LEFT_CHECKBOX_CHECKED
A boolean field that denotes whether a left checkbox is checked or not IS_RIGHT_CHECKBOX_CHECKED
A boolean field that denotes whether a right checkbox is checked or not IS_TABLE_SPAN
IS_TOP_CHECKBOX_CHECKED
A boolean field that denotes whether a top checkbox is checked or not JSON_COL
PAGE_CHAR_STARTS
A list of character offsets that denote where in the rich_doc_text each page starts PAGE_DOCS
A list of RichDoc objects where each item corresponds to the RichDoc of a single page, in order PDF_URL_COL
The URL/file location of the PDF PKL_COL
Serialized version of the RichDoc object SPAN_END_CHAR_OFFSET
The offset from the char_end of the last word in a span SPAN_END_WORD_ID
The id of the word at the end of the span, relative to all other words in the document SPAN_NGRAM
An n-gram of the words in the span; See Ngram
for moreSPAN_PAGE_ID
The page # that the span belongs to SPAN_START_CHAR_OFFSET
The offset from the char_start of the first word in a span SPAN_START_WORD_ID
The id of the word at the start of the span, relative to all other words in the document TABLES
TABLE_COLUMN_ID
TEXT_CLUSTERS
A TextClusters
object object that captures horizontal clusters of words; requires a TextClusterer in the DAGTEXT_CLUSTER_ID
A global cluster id associated with each text cluster TEXT_COL
Extracted plain text from the PDF TEXT_REGION_ID
When LinesFeaturizer is added, the user can group vertical and horizontal lines into regions; this field denotes which region is associate with a span - CHECKBOXES = 'checkboxes'
A
Layout
object that captures checkboxes in a doc; requires a CheckboxFeaturizer in the DAG
- CONTEXT_PAGES = 'context_pages'
JSON dump containing the positions of words, lines, etc
- DOC_COL = 'rich_doc'
The RichDoc object, which can be used to extract properties of the document
- HV_LINES = 'hv_lines'
A
HVLines
object that captures the vertical/horiztonal lines in a doc; requires a LinesFeaturizer in the DAG
- IS_BOTTOM_CHECKBOX_CHECKED = 'is_bottom_checkbox_checked'
A boolean field that denotes whether a bottom checkbox is checked or not
- IS_CHECKED = 'is_checked'
A boolean field that denotes whether a checkbox is checked or not
- IS_LEFT_CHECKBOX_CHECKED = 'is_left_checkbox_checked'
A boolean field that denotes whether a left checkbox is checked or not
- IS_RIGHT_CHECKBOX_CHECKED = 'is_right_checkbox_checked'
A boolean field that denotes whether a right checkbox is checked or not
- IS_TABLE_SPAN = 'is_table_span'
- IS_TOP_CHECKBOX_CHECKED = 'is_top_checkbox_checked'
A boolean field that denotes whether a top checkbox is checked or not
- JSON_COL = 'rich_doc_json'
- PAGE_CHAR_STARTS = 'page_char_starts'
A list of character offsets that denote where in the rich_doc_text each page starts
- PAGE_DOCS = 'page_docs'
A list of RichDoc objects where each item corresponds to the RichDoc of a single page, in order
- PDF_URL_COL = 'rich_doc_pdf_url'
The URL/file location of the PDF
- PKL_COL = 'rich_doc_pkl'
Serialized version of the RichDoc object
- SPAN_END_CHAR_OFFSET = 'rich_doc_span_end_char_offset'
The offset from the char_end of the last word in a span
- SPAN_END_WORD_ID = 'rich_doc_span_end_word_id'
The id of the word at the end of the span, relative to all other words in the document
- SPAN_NGRAM = 'rich_doc_span_ngram'
An n-gram of the words in the span; See
Ngram
for more
- SPAN_PAGE_ID = 'rich_doc_span_page_id'
The page # that the span belongs to
- SPAN_START_CHAR_OFFSET = 'rich_doc_span_start_char_offset'
The offset from the char_start of the first word in a span
- SPAN_START_WORD_ID = 'rich_doc_span_start_word_id'
The id of the word at the start of the span, relative to all other words in the document
- TABLES = 'tables'
- TABLE_COLUMN_ID = 'table_column_id'
- TEXT_CLUSTERS = 'text_clusters'
A
TextClusters
object object that captures horizontal clusters of words; requires a TextClusterer in the DAG
- TEXT_CLUSTER_ID = 'text_cluster_id'
A global cluster id associated with each text cluster
- TEXT_COL = 'rich_doc_text'
Extracted plain text from the PDF
- TEXT_REGION_ID = 'text_region_id'
When LinesFeaturizer is added, the user can group vertical and horizontal lines into regions; this field denotes which region is associate with a span