Skip to main content
Version: 0.91

snorkelflow.rich_docs.RichDocCols

class snorkelflow.rich_docs.RichDocCols

Bases: object

Base class that specifies Rich Doc columns.

__init__()

Methods

__init__()

Attributes

CONTEXT_PAGES

JSON dump containing the positions of words, lines, etc

DOC_COL

The RichDoc object, which can be used to extract properties of the document

HV_LINES

A HVLines object that captures the vertical/horiztonal lines in a doc; requires a LinesFeaturizer in the DAG

JSON_COL

LAYOUT_STRUCTURE_BBOXS

When DocumentLayoutFeaturizer is added, the user can group layout structures into regions

PAGE_CHAR_STARTS

A list of character offsets that denote where in the rich_doc_text each page starts

PAGE_DOCS

A list of RichDoc objects where each item corresponds to the RichDoc of a single page, in order

PDF_URL_COL

The URL/file location of the PDF

PKL_COL

Serialized version of the RichDoc object

SPAN_END_CHAR_OFFSET

The offset from the char_end of the last word in a span

SPAN_END_WORD_ID

The id of the word at the end of the span, relative to all other words in the document

SPAN_NGRAM

An n-gram of the words in the span; See Ngram for more

SPAN_PAGE_ID

The page # that the span belongs to

SPAN_START_CHAR_OFFSET

The offset from the char_start of the first word in a span

SPAN_START_WORD_ID

The id of the word at the start of the span, relative to all other words in the document

TEXT_CLUSTERS

A TextClusters object object that captures horizontal clusters of words; requires a TextClusterer in the DAG

TEXT_CLUSTER_ID

A global cluster id associated with each text cluster

TEXT_COL

Extracted plain text from the PDF

TEXT_REGION_ID

When LinesFeaturizer is added, the user can group vertical and horizontal lines into regions; this field denotes which region is associate with a span

CONTEXT_PAGES = 'context_pages'

JSON dump containing the positions of words, lines, etc

DOC_COL = 'rich_doc'

The RichDoc object, which can be used to extract properties of the document

HV_LINES = 'hv_lines'

A HVLines object that captures the vertical/horiztonal lines in a doc; requires a LinesFeaturizer in the DAG

JSON_COL = 'rich_doc_json'
LAYOUT_STRUCTURE_BBOXS = 'layout_structure_bboxs'

When DocumentLayoutFeaturizer is added, the user can group layout structures into regions

PAGE_CHAR_STARTS = 'page_char_starts'

A list of character offsets that denote where in the rich_doc_text each page starts

PAGE_DOCS = 'page_docs'

A list of RichDoc objects where each item corresponds to the RichDoc of a single page, in order

PDF_URL_COL = 'rich_doc_pdf_url'

The URL/file location of the PDF

PKL_COL = 'rich_doc_pkl'

Serialized version of the RichDoc object

SPAN_END_CHAR_OFFSET = 'rich_doc_span_end_char_offset'

The offset from the char_end of the last word in a span

SPAN_END_WORD_ID = 'rich_doc_span_end_word_id'

The id of the word at the end of the span, relative to all other words in the document

SPAN_NGRAM = 'rich_doc_span_ngram'

An n-gram of the words in the span; See Ngram for more

SPAN_PAGE_ID = 'rich_doc_span_page_id'

The page # that the span belongs to

SPAN_START_CHAR_OFFSET = 'rich_doc_span_start_char_offset'

The offset from the char_start of the first word in a span

SPAN_START_WORD_ID = 'rich_doc_span_start_word_id'

The id of the word at the start of the span, relative to all other words in the document

TEXT_CLUSTERS = 'text_clusters'

A TextClusters object object that captures horizontal clusters of words; requires a TextClusterer in the DAG

TEXT_CLUSTER_ID = 'text_cluster_id'

A global cluster id associated with each text cluster

TEXT_COL = 'rich_doc_text'

Extracted plain text from the PDF

TEXT_REGION_ID = 'text_region_id'

When LinesFeaturizer is added, the user can group vertical and horizontal lines into regions; this field denotes which region is associate with a span