operators.pdf.truncate_pdf.TruncatePDF
- class operators.pdf.truncate_pdf.TruncatePDF(field, pdf_storage_dir, target_field=None, pages=5, ignore_errors=False)
Truncates a PDF to a certain # of pages.
Truncates a given column of pdf urls to a certain # of pages and write the paths to the truncated pdfs to a new column. This will work for native and scanned PDFs.
Parameters
Parameters
Name Type Default Info field str
The field you want to truncate. pdf_storage_dir str
The directory to store the truncated pdfs in. target_field str
The field you want to write the truncated PDF paths to. pages int
The number of pages to truncate to. ignore_errors bool, optional
Whether to ignore errors when parsing the PDF documents. If True, the original PDF documents will be written to the target column even if there are errors during truncation.