Spaces are improperly embedded inside of words when cut/pasted from document

tpirkle.meditract · June 5, 2015, 10:23am

We have programs that OCR PDF documents for our clients. After the OCR process, the text in the document is correct, both visually, and when cutting and pasting from the document.

Another process in our system uses Aspose PDF to extract the text for searching (after the OCR process above has been completed). When this is done, the text of the document has embedded spaces as shown in the attached document. We have verified that the embedded spaces come from the Aspose process. We are using Aspose PDF 10.4. Please help us resolve this.

codewarior · June 8, 2015, 4:37am

Hi Tony,

Thanks for contacting support.

I have tested the scenario and have observed that when extracting Text from PDF document, the blank space character is embedded inside various words inside PDF document. For
the sake of correction, I have logged it in our issue tracking system as PDFNEWNET-38828. We
will investigate this issue in details and will keep you updated on the status
of a correction. We
apologize for your inconvenience.

codewarior · June 8, 2015, 5:45am

Hi Tony,

For testing purposes, I have used the following code snippet.

[C#]

//open
document<o:p></o:p>

Document pdfDocument = new Document("c:/pdftest/Doc+with+problem.pdf");

System.Text.StringBuilder builder = new System.Text.StringBuilder();

//string to hold extracted text

string extractedText = "";

foreach (Page pdfPage in pdfDocument.Pages)

{

using (MemoryStream textStream = new MemoryStream())

{

//create text device

TextDevice textDevice = new TextDevice();

//set text extraction options - set text extraction mode (Raw or Pure)

TextExtractionOptions textExtOptions = new

TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure);

textDevice.ExtractionOptions = textExtOptions;

//convert a particular page and save text to the stream

textDevice.Process(pdfPage, textStream);

//close memory stream

textStream.Close();

//get text from memory stream

extractedText = Encoding.Unicode.GetString(textStream.ToArray());

}

builder.Append(extractedText);

}

// save the extracted text in text file

<span style=“font-size:9.5pt;line-height:115%;font-family:Consolas;mso-fareast-font-family:
“Malgun Gothic”;mso-fareast-theme-font:minor-fareast;color:#2B91AF;background:
white;mso-highlight:white;mso-ansi-language:EN-US;mso-fareast-language:KO;
mso-bidi-language:AR-SA”>File.WriteAllText(<span style=“font-size:9.5pt;line-height:115%;font-family:Consolas;mso-fareast-font-family:
“Malgun Gothic”;mso-fareast-theme-font:minor-fareast;color:#A31515;background:
white;mso-highlight:white;mso-ansi-language:EN-US;mso-fareast-language:KO;
mso-bidi-language:AR-SA”>“c:/pdftest/input_Text_Extracted.txt”, builder.ToString());

tt.t.zhao · February 23, 2018, 6:28am

Came across the same situation, is there any solutions to resolve this problem?

imran.rafique · February 23, 2018, 4:01pm

@tt.t.zhao,

Kindly send all details of the scenario, including source PDF and code. We will investigate your scenario and share our findings with you.

tt.t.zhao · February 26, 2018, 5:51am

Actually, I came across exactly the same situation, which is "We have programs that OCR PDF documents for our clients. After the OCR process, the text in the document is correct, both visually, and when cutting and pasting from the document.

Another process in our system uses Aspose PDF to extract the text for searching (after the OCR process above has been completed). When this is done, the text of the document has embedded spaces as shown in the attached document. We have verified that the embedded spaces come from the Aspose process. " We are using Aspose PDF 17.11.0.0.

I have used your testing code snippet.

[C#]
//open
document

Document pdfDocument = new Document(“c:/pdftest/Doc+with+problem.pdf”);

System.Text.StringBuilder builder
= new
System.Text.StringBuilder();

//string
to hold extracted text

string extractedText = “”;

foreach (Page pdfPage in pdfDocument.Pages)

{

using (MemoryStream textStream = new MemoryStream())

{

//create
text device

TextDevice textDevice = new TextDevice();

//set
text extraction options - set text extraction mode (Raw or Pure)

TextExtractionOptions textExtOptions = new

TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure);

textDevice.ExtractionOptions =
textExtOptions;

//convert
a particular page and save text to the stream

textDevice.Process(pdfPage,
textStream);

//close
memory stream

textStream.Close();

//get
text from memory stream

extractedText = Encoding.Unicode.GetString(textStream.ToArray());

}

builder.Append(extractedText);

}

//
save the extracted text in text file

<span style=“font-size:9.5pt;line-height:115%;font-family:Consolas;mso-fareast-font-family:
“Malgun Gothic”;mso-fareast-theme-font:minor-fareast;color:#2B91AF;background:
white;mso-highlight:white;mso-ansi-language:EN-US;mso-fareast-language:KO;
mso-bidi-language:AR-SA”>File.WriteAllText(<span style=“font-size:9.5pt;line-height:115%;font-family:Consolas;mso-fareast-font-family:
“Malgun Gothic”;mso-fareast-theme-font:minor-fareast;color:#A31515;background:
white;mso-highlight:white;mso-ansi-language:EN-US;mso-fareast-language:KO;
mso-bidi-language:AR-SA”>“c:/pdftest/input_Text_Extracted.txt”, builder.ToString());

imran.rafique · February 26, 2018, 1:59pm

@tt.t.zhao,

We have already logged a ticket Id PDFNET-38828 in our bug tracking system, which is pending for the analysis and not resolved. However, we recommend our clients to send each their problematic document for the testing purposes. Once the root cause is fixed, then the shared scenario will also be verified. Please send us your source PDF document. Your response is awaited.

aspose.notifier · June 23, 2018, 8:46pm

The issues you have found earlier (filed as PDFNET-38828) have been fixed in this update. This message was posted using BugNotificationTool from Downloads module by asad.ali