Skip to main content
Version: 25.1

operators.pdf.truncate_pdf.TruncatePDF

class operators.pdf.truncate_pdf.TruncatePDF(field, pdf_storage_dir, target_field=None, pages=5, ignore_errors=False)

Truncates a PDF to a certain # of pages.

Truncates a given column of pdf urls to a certain # of pages and write the paths to the truncated pdfs to a new column. This will work for native and scanned PDFs.

Parameters:
  • field (str) – The field you want to truncate

  • pdf_storage_dir (str) – The directory to store the truncated pdfs in

  • target_field (str) – The field you want to write the truncated PDF paths to

  • pages (int) – The number of pages to truncate to

  • ignore_errors (bool, optional) – Whether to ignore errors when parsing the PDF documents. If True, the original PDF documents will be written to the target column even if there are errors during truncation.