operators.pdf.truncate_pdf.TruncatePDF
- class operators.pdf.truncate_pdf.TruncatePDF(field, pdf_storage_dir, target_field=None, pages=5, ignore_errors=False)
Truncates a PDF to a certain # of pages.
Truncates a given column of pdf urls to a certain # of pages and write the paths to the truncated pdfs to a new column. This will work for native and scanned PDFs.
- Parameters:
field (str) – The field you want to truncate
pdf_storage_dir (str) – The directory to store the truncated pdfs in
target_field (str) – The field you want to write the truncated PDF paths to
pages (int) – The number of pages to truncate to
ignore_errors (bool, optional) – Whether to ignore errors when parsing the PDF documents. If True, the original PDF documents will be written to the target column even if there are errors during truncation.