How to extract Pdf to text ignore hidden text?

thanhld · October 26, 2019, 3:05am

I converting pdf to text but have a problem is the output has the text which doesn’t displayed in pdf file.
I want to ignore these text line, how can I do that?
This is my sample pdf file
3696-YC91_HCXHR-MITSUI_&CO(THAILAND).pdf (4.5 KB)

asad.ali · October 26, 2019, 9:35am

@thanhld

Would you kindly share the sample code snippet which you are using to convert PDF into .txt so that we can test the scenario in our environment and address it accordingly.

thanhld · October 28, 2019, 6:53am

Thank you for your response.
I use the sample code of aspose provided. Here my C# code:

public static void GetTextFromFile()
{
	string myDir = @"E:\Project\Example\Aspose\test-file"; // here is the folder that contains my pdf file
	var pdfs = Directory.GetFiles(myDir, "*.pdf");
	foreach (var pdf in pdfs)
	{
		// Open input PDF
		PdfExtractor pdfExtractor = new PdfExtractor();
		pdfExtractor.TextSearchOptions.IgnoreShadowText = true;
		pdfExtractor.BindPdf(pdf);

		// Use parameterless ExtractText method
		pdfExtractor.ExtractText();

		MemoryStream tempMemoryStream = new MemoryStream();
		pdfExtractor.GetText(tempMemoryStream);

		string text = "";
		// Specify Unicode encoding type in StreamReader constructor
		using (StreamReader streamReader = new StreamReader(tempMemoryStream, Encoding.Unicode))
		{
			streamReader.BaseStream.Seek(0, SeekOrigin.Begin);
			text = streamReader.ReadToEnd();
		}
		File.WriteAllText(Path.ChangeExtension(pdf, "txt"), text);
	}
}

asad.ali · October 28, 2019, 3:18pm

@thanhld

The sample code you provided was in C#. Would you please share the Java code you are using.

thanhld · October 29, 2019, 1:14am

Sorry. I have a mistake, it’s C# code.

asad.ali · October 29, 2019, 12:33pm

@thanhld

We are checking it and will get back to you shortly.

asad.ali · October 30, 2019, 4:06pm

@thanhld

We have tested the scenario in our environment and were unable to notice any issue while using Aspose.PDF for .NET 19.10 with following code snippet:

Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(dataDir + "3696-YC91_HCXHR-MITSUI_&_CO_(THAILAND) (1).pdf");
Aspose.Pdf.Text.TextAbsorber textAbsorber = new Aspose.Pdf.Text.TextAbsorber();
pdfDocument.Pages.Accept(textAbsorber);
File.WriteAllText(dataDir + "testPDF.txt", textAbsorber.Text);

For your kind reference, an output .txt file is also attached. Would you kindly look at this and see if there is some issue with it. You can share a screenshot with us in case you observe some anomaly. We will further proceed to assist you accordingly.

testPDF.zip (563 Bytes)

thanhld · October 31, 2019, 1:37am

I send you a screenshot that I point the difference between the pdf file and the output text file
pdf-problem.png (72.6 KB)

asad.ali · October 31, 2019, 9:18am

@thanhld

Thanks for sharing the screenshot.

We can now notice that invisible text is also being extracted by the API. However, would you kindly share that how this text was added in the PDF? OR it was already present and made invisible by applying white or transparent color?

thanhld · October 31, 2019, 9:36am

Here is the way that my users created the hidden text field. They use pdfescape.com to edit pdf and I don’t know how exact it was work.
Presentation1.zip (6.4 MB)

asad.ali · October 31, 2019, 5:55pm

@thanhld

We have logged an investigation ticket as PDFNET-47203 in our issue tracking system. We will further investigate on this scenario and keep you posted with the status of ticket resolution. Please be patient and spare us little time.

thanhld · November 1, 2019, 1:29am

Okay, thank you!