How can I read a specially decorated text throught TextFragment in PDF?

DK1_Lee · December 17, 2023, 7:29am

I am using [Aspose.PDF for Java].

I have the following string in my PDF document:
pdf.png (14.7 KB)

If I read the above texts with TextFragment.getText(), it is read in a broken state as shown below.

As suggested by Goodfellow et al.[7], the loss of the discriminator
on inputs from the generator!퐷퐺is computed as the log probability
of the generator’s output being a real English sentence:
!퐷퐺=;>6¹퐷¹퐺¹Bººº
This is instead of the log probability of the generator’s output
not being a real English sentence;>6¹1−퐷¹퐺¹Bºººto make training
the generator easier in the early stages [7]. The generator is trained

Is there a way to read it properly using TextFragment?
Or, how should I read the content of the original text (in PDF) containing these sentences and process it when creating another document?

asad.ali · December 17, 2023, 6:10pm

@DK1_Lee

Can you please share your sample PDF document along with complete code snippet that you are using. We need to test the scenario in our environment and address it accordingly.

DK1_Lee · December 17, 2023, 10:47pm

Thanks for the reply.
I got this from https://arxiv.org/pdf/2111.15166v1.pdf.
Thank you.

asad.ali · December 18, 2023, 12:40am

@DK1_Lee

Thanks for sharing the sample file. Would you please also share the sample code snippet for our reference so that we can use it to replicate the issue that you are facing?

DK1_Lee · December 18, 2023, 12:46am

The code is simple ( Almost the same as the one of your example codes ).

File file = new File(“2111.15166v1.pdf”);
Document doc = new Document(file.getPath());

// Instantiate ParagraphAbsorber
ParagraphAbsorber absorber = new ParagraphAbsorber();
absorber.visit(doc);

List mlist = absorber.getPageMarkups();
for(int i=0; i < mlist.size(); i++ ) {
PageMarkup markup = mlist.get(i);

List slist = markup.getSections();
for(int j=0; j < slist.size(); j++) {
MarkupSection section = slist.get(j);

  List<MarkupParagraph> plist = section.getParagraphs();
  
  StringBuilder sb = new StringBuilder();
  for(int k=0; k < plist.size(); k++) {
     MarkupParagraph paragraph = plist.get(k);
     for (List<TextFragment> line : paragraph.getLines()) {
        for (TextFragment frag : line) {
           sb.append(frag.getText());
        }
        sb.append("\r\n");
     }
  }
  sb.append("\r\n");
  System.out.println(sb.toString());

}
}

asad.ali · December 18, 2023, 1:54pm

@DK1_Lee

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFJAVA-43399

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

DK1_Lee · January 27, 2024, 2:19pm

I’m stuck here (40 days). This issue is pretty basic to extracting text as you know, so I guess I can’t avoid it as long as I use this module.
I was expecting a quick response since this is also handled by the Public Domain module, but it seems to be taking a long time.
What’s going on ?

asad.ali · January 28, 2024, 12:36am

@DK1_Lee

We apologize for the trouble this issue may have caused. The issue was investigated and found in the internal component of the API. The text handling engine of the API needs further investigation. Please note that an issue may look basic but its fix may involve massive changes in the existing modules of the API and it could take certain amount of time for us to get it fixed.

Nevertheless, we have recorded your concerns and have updated the ticket information accordingly. We will surely consider them and as soon as we make some progress towards ticket resolution, we will update you. Please spare us some time.

DK1_Lee · January 29, 2024, 11:20am

In case it helps you in your testing, here is the material I tested.
Note that the results are slightly different between the 23.11 and 23.12 versions.

Test input material: GAN-2111.15166.pdf (631.6 KB)
Test result comparison: TEST-RESULTS(PDFJAVA-43399).docx (48.1 KB)

I hope to see your results soon…

asad.ali · January 29, 2024, 7:16pm

@DK1_Lee

Sure, the ticket information has been updated accordingly. We will surely share updates with you as soon as we have some.

DK1_Lee · March 2, 2024, 3:15pm

I’m embarrassed to even ask my boss to wait more.
All 4 months of versions (23.11 to 24.2) are not handling this issue(PDFJAVA-43399) correctly.
I will send you the results of my tests on the current version (24.2).

We have not made any progress on this issue and it is causing a lot of problems in our schedule.
Please don’t delay any further.

TEST-RESULTS(PDFJAVA-43399)-24.2.docx (51.3 KB)

asad.ali · March 2, 2024, 5:43pm

@DK1_Lee

Please accept our humble apologies for the delay and the inconvenience you may have been facing due to this issue. Please note that we can only prioritize the issues to some extent in the free support model. Otherwise, they are resolved on a first come first serve basis and resolution time depends upon the number of issues as well as complex nature of the issue itself.

Nevertheless, the ticket was already escalated by considering your concerns and its investigation is underway. The nature of the issue is complex and different sub-tasks have been created in our issue management system that need to be closed to implement the fix. Some of these sub-tasks are still open and are being worked on. As soon as we were able to ship the fix of this issue, we will inform you in this forum thread. We highly appreciate your patience and comprehension in this regard.

DK1_Lee · March 3, 2024, 3:41pm

Thanks for the ticket.
In this case, if there was no bug, there would be no need for a ticket.

This is a text extraction bug ticket, not for a special add-on for our business.
Shouldn’t it be considered according to its content ?

Note) In the file I attached above, I’ve included a little example of how it was handled correctly in another package.

asad.ali · March 3, 2024, 9:48pm

@DK1_Lee

Would you please specify that another package as well so that we can check from that perspective as well?

Sometimes, issues are related to specific type of PDF documents and they are resolved only for those PDF files. Please note that PDF is a dynamic and massive file format in terms of its structure and complexity. Even two identical PDF document can differ in terms of their structures. The ticket was logged to investigate the PDF document and add support in the API to deal with such kind of PDFs for text extraction.

DK1_Lee · March 4, 2024, 2:37am

another package

com.itextpdf
itextpdf
5.5.13.3

    public static void main(String[] args) throws IOException {
    	File file = new File("GAN-2111.15166.pdf");
        PdfReader reader = new PdfReader(file.getPath());
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        TextExtractionStrategy strategy;
        for (int i = 1; i <= reader.getNumberOfPages(); i++) {
            strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
            System.out.println(strategy.getResultantText());
        }
        reader.close();
    }

I think ASPOSE is more expert on PDF format, so if you have any recommendations, I will follow them. I know it’s hard, but every time you solve a problem, you’re moving up a level. Cheers !!.

asad.ali · March 4, 2024, 9:38am

@DK1_Lee

Thanks for providing the requested information and sharing kind feedback. We have updated the ticket information as per the details provided by you lately. We are afraid that we are not in a position to share any recommendations or workaround at the moment because ticket is still under the phase of investigation. Various API processing modules are being worked on in order to get this issue sorted out. As soon as we have some updates worth-mentioning, we will let you know via this forum thread. Your patience is highly appreciated in this regard.

DK1_Lee · April 25, 2024, 11:21am

I don’t know if you have the willpower to solve the problem of extracting the text I posted in this thread.
Even if it is fixed in this version, I’ve been waiting 4.5 months for it.
Hopefully it will be fixed in this version.

Also, after purchasing your package, my business has been held up by the bug, and the only thing I’ve done is report the bug.
I ask for the same treatment for this package license.
I hope you can resolve both the bug issue and the license issue.

I don’t know what to report to my boss.
I hope everything works out…

P.S. Please note that the PDF file we used to report the bug in this thread is not specially created by us, it is an example that can be easily found on the internet.

asad.ali · April 25, 2024, 6:19pm

@DK1_Lee

As shared earlier, the issue is under the phase of the investigation and we are afraid that it is taking time longer than expected because we need to investigate various module of text extraction engine of the API. The original task has dependency upon various sub tasks which were generated to address the issue internally.

We do apologize for the inconvenience and delay you have been facing. However, we are trying our best to deal with this issue and incorporate its fix. We will inform you once we have further updates about the ticket resolution. Please spare us some time.

DK1_Lee · May 2, 2024, 11:57am

Hello.
I bought the package and never got to use it, but your kindness has brought me this far, hasn’t it ?
However, I don’t think I can continue with apologizing, sorry, etc., I was trying to talk to you in business terms, but it seems that you are still staying as a kindness.
This is going on about 5 months since you asked me to keep waiting .
(I’ve said this in the thread before this one, but all you can do is ask me to wait)

I guess there’s a line that needs to be drawn in business.
If you ask me to wait, I have to ask someone else to wait, and I feel like I’ve reached a line where I can’t do that anymore.

If you can’t come up with a business-like way to do this, I’m going to demand it.

asad.ali · May 2, 2024, 4:25pm

@DK1_Lee

We do understand your concerns and severity of the issue for you. We have been mentioning and sharing that the issue was under the phase of investigation. Please note that we have been working continuously on resolving the issue attached here. There are 5 sub-tasks which were made after the investigation and out of them, 2 are already resolved.

Please note that Aspose.PDF is a massive API and has hundreds of modules with dependencies on each other. In order to resolve this issue, we have to make core changes in our text extraction engine as well as other dependent modules. These are the main reasons that the ticket is taking more time than expected. Along with this issue, we are definitely working on other requests and enhancements parallel to this ticket.

Nevertheless, your concerns have been raised internally and ticket has also been escalated to a certain level of priority. We will inform you as soon as we have more updates about the resolution ETA.