Skip to main content
Version: 0.95

operators.pdf.table.TableFeaturizer

class operators.pdf.table.TableFeaturizer(field='rich_doc_pdf_url', model='microsoft/table-transformer-structure-recognition', pages_field=None)

A featurizer that detects tables in PDF documents.

Parameters:
  • field (str, default: 'rich_doc_pdf_url') – The name of the column containing the PDF URL paths.

  • model (str, default: 'microsoft/table-transformer-structure-recognition') – The pretrained Table Transformer model to use for table detection.

  • pages_field (Optional[str], default: None) – The name of the column containing the page numbers on which to run the operator on. If None, the operator will run on all pages. Defaults to None.

Returns:

A Table object containing the table metadata information.

Return type:

{RichDocCols.TABLES}