snorkelflow.rich_docs.RichDocCols
- class snorkelflow.rich_docs.RichDocCols
Bases:
object
Base class that specifies Rich Doc columns.
- __init__()
Methods
__init__
()Attributes
A
Layout
object that captures checkboxes in a doc; requires a CheckboxFeaturizer in the DAGJSON dump containing the positions of words, lines, etc
The RichDoc object, which can be used to extract properties of the document
A
HVLines
object that captures the vertical/horiztonal lines in a doc; requires a LinesFeaturizer in the DAGA boolean field that denotes whether a bottom checkbox is checked or not
A boolean field that denotes whether a checkbox is checked or not
A boolean field that denotes whether a left checkbox is checked or not
A boolean field that denotes whether a right checkbox is checked or not
A boolean field that denotes whether a top checkbox is checked or not
A list of character offsets that denote where in the rich_doc_text each page starts
A list of RichDoc objects where each item corresponds to the RichDoc of a single page, in order
The URL/file location of the PDF
Serialized version of the RichDoc object
The offset from the char_end of the last word in a span
The id of the word at the end of the span, relative to all other words in the document
An n-gram of the words in the span; See
Ngram
for moreThe page # that the span belongs to
The offset from the char_start of the first word in a span
The id of the word at the start of the span, relative to all other words in the document
A
TextClusters
object object that captures horizontal clusters of words; requires a TextClusterer in the DAGA global cluster id associated with each text cluster
Extracted plain text from the PDF
When LinesFeaturizer is added, the user can group vertical and horizontal lines into regions; this field denotes which region is associate with a span
- CHECKBOXES = 'checkboxes'
A
Layout
object that captures checkboxes in a doc; requires a CheckboxFeaturizer in the DAG
- CONTEXT_PAGES = 'context_pages'
JSON dump containing the positions of words, lines, etc
- DOC_COL = 'rich_doc'
The RichDoc object, which can be used to extract properties of the document
- HV_LINES = 'hv_lines'
A
HVLines
object that captures the vertical/horiztonal lines in a doc; requires a LinesFeaturizer in the DAG
- IS_BOTTOM_CHECKBOX_CHECKED = 'is_bottom_checkbox_checked'
A boolean field that denotes whether a bottom checkbox is checked or not
- IS_CHECKED = 'is_checked'
A boolean field that denotes whether a checkbox is checked or not
- IS_LEFT_CHECKBOX_CHECKED = 'is_left_checkbox_checked'
A boolean field that denotes whether a left checkbox is checked or not
- IS_RIGHT_CHECKBOX_CHECKED = 'is_right_checkbox_checked'
A boolean field that denotes whether a right checkbox is checked or not
- IS_TABLE_SPAN = 'is_table_span'
- IS_TOP_CHECKBOX_CHECKED = 'is_top_checkbox_checked'
A boolean field that denotes whether a top checkbox is checked or not
- JSON_COL = 'rich_doc_json'
- PAGE_CHAR_STARTS = 'page_char_starts'
A list of character offsets that denote where in the rich_doc_text each page starts
- PAGE_DOCS = 'page_docs'
A list of RichDoc objects where each item corresponds to the RichDoc of a single page, in order
- PDF_URL_COL = 'rich_doc_pdf_url'
The URL/file location of the PDF
- PKL_COL = 'rich_doc_pkl'
Serialized version of the RichDoc object
- SPAN_END_CHAR_OFFSET = 'rich_doc_span_end_char_offset'
The offset from the char_end of the last word in a span
- SPAN_END_WORD_ID = 'rich_doc_span_end_word_id'
The id of the word at the end of the span, relative to all other words in the document
- SPAN_NGRAM = 'rich_doc_span_ngram'
An n-gram of the words in the span; See
Ngram
for more
- SPAN_PAGE_ID = 'rich_doc_span_page_id'
The page # that the span belongs to
- SPAN_START_CHAR_OFFSET = 'rich_doc_span_start_char_offset'
The offset from the char_start of the first word in a span
- SPAN_START_WORD_ID = 'rich_doc_span_start_word_id'
The id of the word at the start of the span, relative to all other words in the document
- TABLES = 'tables'
- TABLE_COLUMN_ID = 'table_column_id'
- TEXT_CLUSTERS = 'text_clusters'
A
TextClusters
object object that captures horizontal clusters of words; requires a TextClusterer in the DAG
- TEXT_CLUSTER_ID = 'text_cluster_id'
A global cluster id associated with each text cluster
- TEXT_COL = 'rich_doc_text'
Extracted plain text from the PDF
- TEXT_REGION_ID = 'text_region_id'
When LinesFeaturizer is added, the user can group vertical and horizontal lines into regions; this field denotes which region is associate with a span