Remove headers & footers, page numbers from PDF

PjCouldBe · November 7, 2017, 8:24am

Good day!
I’m using Aspose.Pdf for Java 17.7.0. And I need to remove all headers and footers from input PDFs. I’ve already tried to use the code snippets from this post: Removing footers . Moreover, I have following mandatory requirements:

No operations with file system during PDF processing (e. g. no using of temp files or other external reesources);
After removing I need to extract plain text, so I could get a com.aspose.Document object after headers-footers removing or do it within com.aspose.pdf.Document object.
Conversion to word document (.DOCX, .ODT or others) is prohibited.

Taking into account above restrictions, I use the following code to remove headers-footers from PDF and extract plain text:

@Nonnull
public String extract(@Nonnull byte[] bytes) throws Exception {
    //open file
    Document pdfDocument;
    String originalText;
    try (InputStream fileInputStream = new ByteArrayInputStream(bytes)) {
        PdfContentEditor pce = new PdfContentEditor();
        pce.bindPdf(fileInputStream);
        pce.deleteStampByIds(new int[] {100, 101});  //delete headers and footers
        try (ByteArrayOutputStream bos = new ByteArrayOutputStream()) {
            pce.save(bos);
            try (ByteArrayInputStream bis = new ByteArrayInputStream(bos.toByteArray())) {
                pdfDocument = new Document(bis);
            }
        }
       
        // pdfDocument = new Document(fileInputStream);
    }

    com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();

    // Accept the absorber for all the pages
    pdfDocument.getPages().accept(textAbsorber);

    // Get the extracted text
    originalText = textAbsorber.getText();

    // cleanup from BOM symbols
    StringUtilities strUtils = new StringUtilities();
    originalText = strUtils.removeAllUTF8BOM(originalText);

    originalText = new PdfTextNormalizer().normalizePdfText(originalText);
    return originalText;
}

I have run this code on some documents (in attachments) but no headers-footers was removed. Is it possible to correct this code?

Thanks!

P. S. I have attached some example documents below. All of them are with headers-footers:
General Terms of Use-1.pdf (264.3 KB)
CQ 5.5 OnPremise (License Terms 2012v1)-1.pdf (354.2 KB)
Adobe Connect Hosted Terms of Service-1.pdf (402.4 KB)

imran.rafique · November 7, 2017, 6:13pm

@PjCouldBe,

You can remove the header and footer by defining the region of the page. The Rect property of the page instance returns the rectangular region of the page and we can modify it as per our need. In the following code example, we have removed the header of the first page, and then extracting the whole text of this page. Please also refer to these help topics: Redact certain page region with RedactionAnnotation and Extract Text from PDF

[Java]

String dataDir = "C:\\Pdf\\test454\\";
// Open document
Document document = new Document(dataDir + "General Terms of Use-1.pdf");
// define page index and region
int pageIndex = 1;
Page page = document.getPages().get_Item(pageIndex); 
Rectangle rect = new Rectangle(0, page.getRect().getHeight() * 0.95, page.getRect().getWidth(), page.getRect().getHeight());
	    
// Create RedactionAnnotation instance for specific page region
RedactionAnnotation annot = new RedactionAnnotation(page, rect);
annot.setFillColor(Color.getWhite());
annot.setBorderColor(Color.getYellow());
annot.setColor(Color.getBlue());
	    
annot.setTextAlignment(HorizontalAlignment.Center);
// Add annotation to annotations collection of first page
page.getAnnotations().add(annot);
// Flattens annotation and redacts page contents (i.e. removes text and image
// Under redacted annotation)
annot.redact();
	    
// Create TextAbsorber object to extract text
TextAbsorber textAbsorber = new TextAbsorber();
// Accept the absorber for all the pages
document.getPages().get_Item(pageIndex).accept(textAbsorber);

// Get the extracted text
String extractedText = textAbsorber.getText();

// Create a writer and open the file
java.io.FileWriter writer = new java.io.FileWriter(new java.io.File(dataDir + "Extracted_text.txt"));
writer.write(extractedText);
// Write a line of text to the file tw.WriteLine(extractedText);
// Close the stream
writer.close();
document.save(dataDir + "RedactPage_out.pdf");

These are the output files: RedactPage_out.pdf (261.2 KB) and Extracted_text.zip (1.3 KB)

PjCouldBe · November 8, 2017, 5:50am

Great! Thank you for your reply!

The only thing I do not understand - why the upper bound rectangle has the coordinates (0, 0.95*H, W, H), W - page width, H - page height? Does page height start from bottom of the page?

imran.rafique · November 8, 2017, 3:20pm

@PjCouldBe,

Well, the height is the total vertical length and we define upper 0.5 percent area to redact. The rectangle coordinates start from the left bottom corner. Kindly let us know for any further assistance or queries.