Parsing highlighted text issues

Hello Team,


We are unable to capture the correct quadrilateral points of highlighted text when tagged content is:


· From multi column – it seems like the highlighted is not captured in the correct sequence


· Does not have enough line spacing: it seems like highlights are overlapping, the words in the previous line will also be within the bounds of the next line. This is resulting in duplicates.


Please see attachment for screenshots.

Hi Ismail,


Thank you for contacting support. Your query is about Aspose.Words API therefore we’re moving this thread to the relevant forum (Aspose.Words Support Forum) so that the concerned fellow worker could pick it up in order to assist you with the presented scenario.

Hi,


Thanks for your inquiry. To ensure a timely and accurate response, please attach the following resources here for testing:

  • Your input Word document.
  • Please share your desired output.
  • Please attach the output document that shows the undesired behavior.
  • Please create a standalone console application (source code without compilation errors) that helps us to reproduce your problem on our end and attach it here for testing.

As soon as you get these pieces of information ready, we’ll start investigation into your issue and provide you more information. Thanks for your cooperation.

PS: To attach these resources, please zip them and Click ‘Reply’ button that will bring you to the ‘reply page’ and there at the bottom you can include any attachments with that post by clicking the ‘Add/Update’ button.

Actually we are using Aspose pdf and the issue we are getting is when parsing pdf document. The screenshots are from pdf document.

Please see pdf attachements

Hi there,

Thanks for sharing the detail. Your query is related to Aspose.Pdf APIs. So, I am moving this forum thread to Aspose.Total forum. My colleagues will answer your query shortly.

Hi Ismail,


iadem:
Actually we are using Aspose pdf and the issue we are getting is when parsing pdf document. The screenshots are from pdf document.

Please see pdf attachements


We will appreciate it if you please share your sample code here as well, it will help us to investigate and address your issue exactly.

Best Regards,
Please see attachment

Hi Ismail,


Thanks for sharing your sample code. But I am afraid I am unable to test your code as it has some missing references e.g PDFTag namespace, AddExtractedTag() method. I will appreciate it if you please share a sample working console project, so we will test the scenario and will guide you accordingly.

We are sorry for the inconvenience.

Best Regards,

I didn’t think you wanted a completely working code( you had asked for sample code). I will get simple console app together and send it to you by tomorrow

Hi Ismail,


We are sorry for the confusion caused.

Please take your time to share a sample application, so that we can test the scenario in our environment.

I have found some issues in how the code was implemented by our offshore team that resulted in duplicated highlighted text. I have solved that issue but I am running into another issue where I am not able to extract just the highlighted text within a given line. Please see code below to see implementation... I was not able to upload the console app


using Aspose.Pdf;
using Aspose.Pdf.Annotations;
using Aspose.Pdf.Text;
using System;
using System.Collections.Generic;
using System.Configuration;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace PDFParser
{
class Program
{
static void Main(string[] args)
{
Aspose.Pdf.License license = new Aspose.Pdf.License();
license.SetLicense("Aspose.Pdf.lic");

// Open document
string pdfPath = "20 pages 20 tags perf testing.pdf";
Document pdfDocument = new Document(pdfPath);
Console.WriteLine("Reading " + pdfPath);

foreach(Page page in pdfDocument.Pages)
{
foreach (Annotation annotation in page.Annotations)
{
if (annotation.AnnotationType == AnnotationType.Highlight)
{
HighlightAnnotation linkAnno = (HighlightAnnotation)annotation;
Aspose.Pdf.Rectangle rect = linkAnno.Rect;

TextAbsorber absorber = new TextAbsorber();
absorber.TextSearchOptions.LimitToPageBounds = true;
absorber.TextSearchOptions.Rectangle = rect;

page.Accept(absorber);

//this is not limiting to only highlighted text
//It is there a way to ignore words that are not highlighted?
//Please look at the first highlighted text in the sample pdf used here
//It should stop at "setting."
string extractedText = absorber.Text;
}
}
Console.WriteLine(".....");
}
}
}
}

Hi Ismail,


Thanks for your sharing additional information with the source code. I have tested your sample code with above shared PDF documents TwoColumnIssue.pdf and lineSpacingIssue.pdf but unable to find any annotation in these files. We will appreciate it if you please share your source document “20 pages 20 tags perf testing.pdf”, it will help us to replicate your issue and address it exactly.

We are sorry for the inconvenience.

Best Regards,

Sorry forgot to attach file yesterday. Please see attachment now…

Hi Ismail,


Thanks for sharing the source document. I have tested the scenario and noticed the reported issue, so logged a ticket PDFNET-42363 in our issue tracking system for further investigation and rectification. We will keep you updated about the issue resolution progress within this forum thread.

We are sorry for the inconvenience.

Best Regards,

Hi Ismail,


Thanks for your inquriy. I am afraid your issue is still pending for investigation, as we have recently noticed the issue. As soon as we made some significant progress towards issue resolution, we will notify you within this forum thread.

We are sorry for the inconvenience.

Best Regards,
Can you give me a time frame? We are trying to decide if we should go with aspose or other product.

Hi Ismail,


Thanks for contacting support.

I am afraid that we cannot share any time frame for now as we just have been notified about the issue. There are also other issues pending in the queue which were logged prior to yours and relevant team is working on fixing/resolving them. I am sure they will soon plan a fix for logged issue as per their development scheme. Your patience will really be appreciated in this matter. Please spare us a little time.

We are sorry for the inconvenience.


Best Regards,

Hi Ismail,


Thanks for your inquiry. I am afraid that earlier logged issue is still pending for review. Product team has been busy in providing fixes against high priority reported issues. Please note that issues have been resolved on first come first serve basis as we believe it is the fairest policy for everyone. Please be patient and cooperate with us in this matter. We will certainly share updates with you in this regards when we have some.

We are sorry for this convenience.


Best Regards,