Optimize PDF documents and combine them in Java using Aspose.PDF

b.schalitz · January 3, 2020, 1:48pm

Our customers upload any kind of PDF which later must be concatenated with other documents. We have problems with very complex PDF with contains constructional drawings (vectorgraphics). The CPU and RAM raises while processing these PDF with more then 1 million objects.

Therefore we decide to preprocess the documents before processing.

Skale every page down to A4
Convert single pages with many objects into PNG grafic and replace these page
web optimise/linarized the documents
resample high dpi pictures
…

I manage to fullfill every single task but have problems with combine them.

public static void main(String[] args) {
	try {
		LicenseReader.getLizenz();
		//42344
		optimizeCSO(new File(path+"Ausbauplan_Bilder.pdf"));
	} catch (Throwable e) {
		e.printStackTrace();
	}
}

private static void optimizeCSO(File file) throws IOException {
	// Open the template document.
	Document pdfDocument = new Document(file.getCanonicalPath());
	pdfDocument.setDisplayDocTitle(true);

	ArrayList<Integer> svgPages = new ArrayList<>();
	for (Page page : pdfDocument.getPages()) {
		int contentCount = (page.getContents() != null ? page.getContents().size() : 0);
		if (contentCount > 50000) svgPages.add(page.getNumber());		
	}

	for (Page page : pdfDocument.getPages()) {
		double originalWidth = page.getPageRect(false).getWidth();
		double originalHeight = page.getPageRect(false).getHeight();
		double targetWidth = PageSize.getA4().getWidth();
		double targetHeight = PageSize.getA4().getHeight();

		// Skip scaling and just crop if either dimension is 0.
		if (originalWidth == 0 || originalHeight == 0) {
			continue;
		}

		// Only scale as much as necessary. Maintain aspect ratio.
		double widthScale = (originalWidth<originalHeight? targetWidth: targetHeight) / originalWidth;
		double heightScale = (originalWidth<originalHeight? targetHeight: targetWidth) / originalHeight;

		double scale = 1.0d;
		if (widthScale <= heightScale) {
			scale = widthScale;
		} else {
			scale = heightScale;
		}

		//nicht vergrößern
		if (scale > 1) continue;

		//Minimale Abweichungen zulassen
		if (scale > 0.90) continue;

		double destWidth = originalWidth * scale;
		double destHeight = originalHeight * scale;

		com.aspose.pdf.facades.PdfFileEditor pfe = new com.aspose.pdf.facades.PdfFileEditor();
		pfe.resizeContents(pdfDocument, new int[] {page.getNumber()}, PdfFileEditor.ContentsResizeParameters.pageResize(destWidth, destHeight));
	}
	
	//	pdfDocument.save(path +"nosave_"+file.getName());
	//	pdfDocument.close();
	//	pdfDocument = new Document(file.getCanonicalPath());
	
	for (Integer svgPageNr : svgPages) {
		//Seite neu aus dem Dokuemt holen
		Page page = pdfDocument.getPages().get_Item(svgPageNr);

		byte[] png = pdfDocument.convertPageToPNGMemoryStream(page);

		Rectangle mediaBox = page.getMediaBox();

		double mediaHeight = mediaBox.getHeight();
		double mediaWidth = mediaBox.getWidth();

		//Alle Objekte entfernen
		page.clearContents();

		double width = page.getPageInfo().getWidth();
		double height = page.getPageInfo().getHeight();

		int rot = page.getRotate();
		if (rot == Rotation.on90 || rot == Rotation.on270) {
			page.setRotate(0);
			width = page.getPageInfo().getHeight();
			height = page.getPageInfo().getWidth();	

			mediaHeight = mediaBox.getWidth();
			mediaWidth = mediaBox.getHeight();
		}

		// Set the page size as A4 (11.7 x 8.3 in) and in Aspose.Pdf, 1 inch = 72 points
		// So A4 dimensions in points will be (842.4, 597.6)
		page.setPageSize(width, height);

		double newLLX = mediaBox.getLLX();
		// We must to move page upper in order to compensate changing page size
		// (lower edge of the page is 0,0 and information is usually placed from the top of the page.
		//That's why we move lover edge upper on difference between old and new height.
		double newLLY = mediaBox.getLLY() + (mediaBox.getHeight() - mediaHeight);
		page.setMediaBox(new Rectangle(newLLX, newLLY, newLLX + mediaWidth, newLLY + mediaHeight));
		// Sometimes we also need to set CropBox (if it was set in original
		// file)
		page.setCropBox(new Rectangle(newLLX, newLLY, newLLX + mediaWidth, newLLY + mediaHeight));

		// load image into stream
		try (ByteArrayOutputStream os = new ByteArrayOutputStream();){
			InputStream imageStream = new ByteArrayInputStream(png);
			// add image to Images collection of Page Resources
			page.getResources().getImages().add(imageStream);
		}
		// using GSave operator: this operator saves current graphics state
		page.getContents().add(new GSave());

		com.aspose.pdf.Rectangle rectangle = page.getCropBox();
		// create Rectangle and Matrix objects
		Matrix matrix = new Matrix(new double[] { rectangle.getURX() - rectangle.getLLX(), 0, 0, rectangle.getURY() - rectangle.getLLY(), rectangle.getLLX(), rectangle.getLLY() });

		// using ConcatenateMatrix (concatenate matrix) operator: defines how image must be placed
		page.getContents().add(new ConcatenateMatrix(matrix));
		com.aspose.pdf.XImage ximage = page.getResources().getImages().get_Item(page.getResources().getImages().size());
		// using Do operator: this operator draws image
		page.getContents().add(new Do(ximage.getName()));
		// using GRestore operator: this operator restores graphics state
		page.getContents().add(new GRestore());
	}

	boolean webopt = true;
	if (!pdfDocument.isLinearized() && webopt) {
		pdfDocument.optimize();
	}

	boolean resopt = true;
	if (resopt) {
		OptimizationOptions opt = new OptimizationOptions();
		// In Java 90 seems to be a good compression / quality setting.
		int jpegQuality = 100;
		if (jpegQuality > 0 && jpegQuality < 100) {
			opt.setCompressImages(true);
			opt.setImageQuality(jpegQuality);
		}

		boolean img_resample = true;
		if (img_resample) {
			opt.setResizeImages(true);
			opt.setMaxResoultion(220);
		}

		opt.setRemoveUnusedObjects(true);
		opt.setAllowReusePageContent(true);

		//Optimize resources in the document according to defined optimization strategy.
		try {
			pdfDocument.optimizeResources(opt);
		} catch (IllegalStateException e) {
			e.printStackTrace();
		}
	}
	pdfDocument.save(path +"nosave_"+file.getName());
	pdfDocument.close();
}

The question is, why I get very different results with or without save the pdf document in den middle of the process. How can i omit this saving and get the wished result nevertheless.

example.zip (9.4 MB)

asad.ali · January 3, 2020, 10:02pm

@b.schalitz

We were able to notice the issue in our environment while using Aspose.PDF for Java 19.12. We tried to use incremental saving approach but it also did not work as expected. Hence, we have logged an issue as PDFJAVA-39076 in our issue tracking system. We will further look into this issue and keep you posted with the status of its correction. Please be patient and spare us little time.

We are sorry for the inconvenience.