Searchable PDF Generated by ASpose.OCR is much greater in size

Hi Team,

We have observed the size of the Searchable PDF is much larger than expected size. we understand than searchable pdf has the additional text in it. But we have seen the size is much larger than expected like 10x. This was fine until we were processing smaller tif ,png, pdf files but when we started processing 50 or 400+ pages of tiff or pdf the searchable pdf just grows exponentially. which is a show stopper problem issue for us.

On further analysis we spotted the Searchable PDF created by Adobe Acrobat was much smaller than what we create from Aspose.PDF/OCR. Looks like Aspose is still using 1.4 version of PDF which is leading to get creation of bigger size pdf compare to the PDF we are creating manually by Adobe 1.7 version.

analysis.zip (546.1 KB)

I cant send those pdf files as they have some confidential data in it. But have masked it for your view in analysis.zip . It has 3 screen shot.

  1. SizeDifference.png
  2. AsposeCreatedPDF.png (version and sizeā€¦details Aspose created)
  3. AdobeCreatedFilePDF.png (version and sizeā€¦details Adobe created Manually some tool)

Can you please help me to fix this size issue and get the PDF shrink as much as possible with optimal readability and size and color balance

Below is the code snippet we are using for PDF Optimization

	/// <summary>
	/// Optimize the Searchable PDF
	/// </summary>
	/// <param name="PdfToSearchablePdfStream"></param>
	private void OptimizePdf(MemoryStream PdfToSearchablePdfStream)
	{

		Document doc = new(PdfToSearchablePdfStream);

		GoToAction action = new(new XYZExplicitDestination(1, 0, 0, 1.5)); // Managing Zoom: 1 = 100%

		doc.OpenAction = action;

		OptimizationOptions optimizationOptions = new OptimizationOptions
		{
			LinkDuplcateStreams = true,
			RemoveUnusedObjects = true, // This helps
			AllowReusePageContent = true,
			CompressObjects = true,
			UnembedFonts = true
		};

		optimizationOptions.ImageCompressionOptions.ResizeImages = true;
		optimizationOptions.ImageCompressionOptions.MaxResolution = _podService.GetPdfMaxResolution();  //240
		optimizationOptions.ImageCompressionOptions.CompressImages = true;
		optimizationOptions.ImageCompressionOptions.Encoding = ImageEncoding.Unchanged;
		optimizationOptions.ImageCompressionOptions.ImageQuality = _podService.GetPdfQuality(); //20
		optimizationOptions.ImageCompressionOptions.Version = Aspose.Pdf.Optimization.ImageCompressionVersion.Fast;

		foreach (var page in doc.Pages)
		{
			foreach (var annotation in page.Annotations)
			{
				annotation.Flatten();
			}

		}
		if (doc.Form.Fields.Count() > 0)
		{
			foreach (var item in doc.Form.Fields)
			{
				item.Flatten();
			}
		}

		doc.OptimizeResources(optimizationOptions);

		//doc.Optimize();

		doc.Save(PdfToSearchablePdfStream);
	}

@Gpatil

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): OCRNET-804

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

Hi @asad.ali any update on this one

@Gpatil

We are afraid that the ticket has not been yet investigated completely. As soon as we have some updates, we will inform you. Please spare us some time.

Hi @asad.ali

Do we have any update on this.

@Gpatil

Please try to use the latest release. PDF creation with new logic. We hope it will be less in size, but in any case the output PDF file will contain images with text, and if in the original PDF were only text without images - the size will be different.