operators.pdf.truncate_pdf.TruncatePDF
- class operators.pdf.truncate_pdf.TruncatePDF(field, pdf_storage_dir, target_field=None, pages=5, ignore_errors=False)
Truncates a PDF to a certain # of pages.
Truncates a given column of pdf urls to a certain # of pages and write the paths to the truncated pdfs to a new column. This will work for native and scanned PDFs.
Parameters
Parameters
Name Type Default Info field strThe field you want to truncate. pdf_storage_dir strThe directory to store the truncated pdfs in. target_field strThe field you want to write the truncated PDF paths to. pages intThe number of pages to truncate to. ignore_errors bool, optionalWhether to ignore errors when parsing the PDF documents. If True, the original PDF documents will be written to the target column even if there are errors during truncation.