snorkelflow.rich_docs.RichDocCols
- class snorkelflow.rich_docs.RichDocCols
- Bases: - object- Base class that specifies Rich Doc columns. - __init__()
 - \_\_init\_\_- __init__- Methods - __init__()- Attributes - CONTEXT_PAGES- JSON dump containing the positions of words, lines, etc - DOC_COL- The RichDoc object, which can be used to extract properties of the document - HV_LINES- A - HVLinesobject that captures the vertical/horiztonal lines in a doc; requires a LinesFeaturizer in the DAG- JSON_COL- LAYOUT_STRUCTURE_BBOXS- When DocumentLayoutFeaturizer is added, the user can group layout structures into regions - PAGE_CHAR_STARTS- A list of character offsets that denote where in the rich_doc_text each page starts - PAGE_DOCS- A list of RichDoc objects where each item corresponds to the RichDoc of a single page, in order - PDF_URL_COL- The URL/file location of the PDF - PKL_COL- Serialized version of the RichDoc object - SPAN_END_CHAR_OFFSET- The offset from the char_end of the last word in a span - SPAN_END_WORD_ID- The id of the word at the end of the span, relative to all other words in the document - SPAN_NGRAM- An n-gram of the words in the span; See - Ngramfor more- SPAN_PAGE_ID- The page # that the span belongs to - SPAN_START_CHAR_OFFSET- The offset from the char_start of the first word in a span - SPAN_START_WORD_ID- The id of the word at the start of the span, relative to all other words in the document - TEXT_CLUSTERS- A - TextClustersobject object that captures horizontal clusters of words; requires a TextClusterer in the DAG- TEXT_CLUSTER_ID- A global cluster id associated with each text cluster - TEXT_COL- Extracted plain text from the PDF - TEXT_REGION_ID- When LinesFeaturizer is added, the user can group vertical and horizontal lines into regions; this field denotes which region is associate with a span - CONTEXT_PAGES = 'context_pages'
- JSON dump containing the positions of words, lines, etc 
 - DOC_COL = 'rich_doc'
- The RichDoc object, which can be used to extract properties of the document 
 - HV_LINES = 'hv_lines'
- A - HVLinesobject that captures the vertical/horiztonal lines in a doc; requires a LinesFeaturizer in the DAG
 - JSON_COL = 'rich_doc_json'
 - LAYOUT_STRUCTURE_BBOXS = 'layout_structure_bboxs'
- When DocumentLayoutFeaturizer is added, the user can group layout structures into regions 
 - PAGE_CHAR_STARTS = 'page_char_starts'
- A list of character offsets that denote where in the rich_doc_text each page starts 
 - PAGE_DOCS = 'page_docs'
- A list of RichDoc objects where each item corresponds to the RichDoc of a single page, in order 
 - PDF_URL_COL = 'rich_doc_pdf_url'
- The URL/file location of the PDF 
 - PKL_COL = 'rich_doc_pkl'
- Serialized version of the RichDoc object 
 - SPAN_END_CHAR_OFFSET = 'rich_doc_span_end_char_offset'
- The offset from the char_end of the last word in a span 
 - SPAN_END_WORD_ID = 'rich_doc_span_end_word_id'
- The id of the word at the end of the span, relative to all other words in the document 
 - SPAN_NGRAM = 'rich_doc_span_ngram'
- An n-gram of the words in the span; See - Ngramfor more
 - SPAN_PAGE_ID = 'rich_doc_span_page_id'
- The page # that the span belongs to 
 - SPAN_START_CHAR_OFFSET = 'rich_doc_span_start_char_offset'
- The offset from the char_start of the first word in a span 
 - SPAN_START_WORD_ID = 'rich_doc_span_start_word_id'
- The id of the word at the start of the span, relative to all other words in the document 
 - TEXT_CLUSTERS = 'text_clusters'
- A - TextClustersobject object that captures horizontal clusters of words; requires a TextClusterer in the DAG
 - TEXT_CLUSTER_ID = 'text_cluster_id'
- A global cluster id associated with each text cluster 
 - TEXT_COL = 'rich_doc_text'
- Extracted plain text from the PDF 
 - TEXT_REGION_ID = 'text_region_id'
- When LinesFeaturizer is added, the user can group vertical and horizontal lines into regions; this field denotes which region is associate with a span