Text Search in pdf page is stuck

Hi we are trying to find co-ordinates of text in a pdf page. But sometime it is stuck and we tried to put the code block in a separate thread with a timeout, but after the timeout the thread is still stuck. Seems aspose code is not thread interruptible.
Please suggest how to kill the thread.

The issue with the following code is that executor thread is stuck at “page.accept()” and timeout is reached and when future is cancelled, the thread is interrupted, but the thread is not stopping its current task, it is still stuck at the method. Even after executor is shutdown and awaitTermination period is over, thread is not getting killed. And after 5 minute the executor thread move ahead of page.accept() method and resumes code. issue seems similar to this (java - future.cancel does not work - Stack Overflow ).
Please suggest ways how page.accept() could take thread interruption in account or alternate ways to speed up the text searching. Our goal is to quickly process the text search with a defined timeout.
Can we do something like this (Set timeout on save|Documentation) ?

Code -

public CoordinateInfo fetchCoordinates(String text, Page page, Integer pageNumner) {
ExecutorService executorService = Executors.newSingleThreadExecutor();
Future coordinateInfoFuture = executorService.submit(new Callable() {
public CoordinateInfo call() {
try {
return fetchCoordinates(text, page, pageNumner);
} catch (Exception e) {
logger.error(“Parallel Thread - Exception While fetching coordinates for text : {} → {}”, text, e);
}
return null;
}
});

    CoordinateInfo finalCoordinateInfo = null;
    long timeout = documentAnnotationServiceProperties.getAsposeDefaultTimeout();
    logger.info("Timeout for fetch Coordinates set as {} for text {}", timeout, text);
    try {
        finalCoordinateInfo = coordinateInfoFuture.get(timeout, TimeUnit.SECONDS);
    } catch (Exception ex) {
        coordinateInfoFuture.cancel(true);
        executorService.shutdownNow();
        try {
            if (executorService.awaitTermination(10, TimeUnit.SECONDS)) {
                logger.error("Termination of thread successful");
            } else {
                logger.error("Thread not terminated");
            }
        }catch (InterruptedException e){
            logger.error("Thread interrupted");
        }
        logger.error("Error occured while fetching coordinates due to : ", ex);
        logger.error(ExceptionUtils.getStackTrace(ex));
    }
    return finalCoordinateInfo;
}

public static CoordinateInfo fetchCoordinates(String text, Page page, Integer pageNumber) {
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(rgx);
TextSearchOptions textSearchOptions = new TextSearchOptions(true);
textSearchOptions.setLogTextExtractionErrors(true);
textSearchOptions.setIgnoreShadowText(true);
textSearchOptions.setIgnoreResourceFontErrors(true);
textFragmentAbsorber.setTextSearchOptions(textSearchOptions);
LOGGER.info(“Before getting fragments from page at pageNumber {} and text {}”, pageNumber, text);
// thread stuck in next line
page.accept(textFragmentAbsorber);
LOGGER.info(“After getting fragments from page at pageNumber {} and text {}”, pageNumber, text);
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
// set coordinateInfo from textFragmentCollection
return coordinateInfo;
}

@manmohansirionlabs

Would you please share your sample PDF document along with the code snippet that you are using to fetch the text coordinates? We will test the scenario in our environment and address it accordingly.

Issue is not occuring with any specific pdf. This shows up generally when bulk of pdfs are processing sequentially
Code used to fetch text co-ordinates-

package com.sirionlabs.annotation.util;

import com.aspose.pdf.*;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.Serializable;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.concurrent.*;

public class test {

private static final Logger logger = LoggerFactory.getLogger(test.class);
public void fetchCoordinatesFromDocument(){
    Document document = new Document("/home/abc/annotation.pdf");
    PageCollection pages = document.getPages();
    List<TextWithPageNo> textsWithPageNos = someFunc();

    for(TextWithPageNo textsWithPageNo : textsWithPageNos){
        Page page = pages.get_Item(textsWithPageNo.pageno);
        for(String text : textsWithPageNo.text){
            fetchCoordinatesWithTimeout(text, page, textsWithPageNo.pageno);
        }

    }
}



 private CoordinateInfo fetchCoordinatesWithTimeout(String text, Page page, Integer pageNumner) {
    ExecutorService executorService = Executors.newSingleThreadExecutor();
    Future<CoordinateInfo> coordinateInfoFuture = executorService.submit(new Callable() {
        public CoordinateInfo call() {
            try {
                return fetchCoordinates(text, page, pageNumner);
            } catch (Exception e) {
                logger.error("Parallel Thread - Exception While fetching coordinates for text : {} -> {}", text, e);
            }
            return null;
        }
    });

    CoordinateInfo finalCoordinateInfo = null;
    long timeout = 10;
    logger.info("Timeout for fetch Coordinates set as {} for text {}", timeout, text);
    try {
        finalCoordinateInfo = coordinateInfoFuture.get(timeout, TimeUnit.SECONDS);
    } catch (Exception ex) {
        coordinateInfoFuture.cancel(true);
        executorService.shutdownNow();
        try {
            if (executorService.awaitTermination(10, TimeUnit.SECONDS)) {
                logger.error("Termination of thread successful");
            } else {
                logger.error("Thread not terminated");
            }
        }catch (InterruptedException e){
            logger.error("Thread interrupted");
        }
        logger.error("Error occured while fetching coordinates due to : ", ex);
    }
    return finalCoordinateInfo;
}

private CoordinateInfo fetchCoordinates(String text, Page page, Integer pageNumber) {
    CoordinateInfo coordinateInfo = new CoordinateInfo();
    List<Coordinate> coordinateList = new ArrayList<>();
    String rgx = someFuncTogetRegex(text);
    if (rgx == null || rgx.isEmpty()) {
        return null;
    }
    TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(rgx);
    TextSearchOptions textSearchOptions = new TextSearchOptions(true);
    textSearchOptions.setLogTextExtractionErrors(true);
    textSearchOptions.setIgnoreShadowText(true);
    textSearchOptions.setIgnoreResourceFontErrors(true);
    textFragmentAbsorber.setTextSearchOptions(textSearchOptions);
    logger.info("Before getting fragments from page at pageNumber {} and text {}", pageNumber, text);
    // thread gets stuck in this line
    page.accept(textFragmentAbsorber);
    logger.info("After getting fragments from page at pageNumber {} and text {}", pageNumber, text);
    TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
    if(textFragmentCollection.size()==0){
        return null;
    }
    for (TextFragment textFragment : textFragmentCollection) {
        textFragment.getSegments().forEach(textSegment -> {
            coordinateInfo.setPageNumber(pageNumber);
            Coordinate coordinate = new Coordinate();
            coordinate.setX1(textSegment.getRectangle().getLLX());
            coordinate.setX2(textSegment.getRectangle().getURX());
            coordinate.setY1(textSegment.getRectangle().getLLY());
            coordinate.setY2(textSegment.getRectangle().getURY());
            coordinateList.add(coordinate);
        });
        coordinateInfo.setCoordinateList(coordinateList);
    }
    if (coordinateInfo == null || coordinateInfo.getCoordinateList() == null || coordinateInfo.getCoordinateList().isEmpty()) {
        return null;
    }
    return coordinateInfo;
}

class TextWithPageNo {
    List<String> text;
    Integer pageno;
}

class Coordinate {
    private double x1;
    private double x2;
    private double y1;
    private double y2;

    public double getX1() {
        return x1;
    }

    public void setX1(double x1) {
        this.x1 = x1;
    }

    public double getX2() {
        return x2;
    }

    public void setX2(double x2) {
        this.x2 = x2;
    }

    public double getY1() {
        return y1;
    }

    public void setY1(double y1) {
        this.y1 = y1;
    }

    public double getY2() {
        return y2;
    }

    public void setY2(double y2) {
        this.y2 = y2;
    }
}

class CoordinateInfo implements Serializable {
    private Integer pageNumber;
    private List<Coordinate> coordinateList;

    public Integer getPageNumber() {
        return pageNumber;
    }

    public void setPageNumber(Integer pageNumber) {
        this.pageNumber = pageNumber;
    }

    public List<Coordinate> getCoordinateList() {
        return coordinateList;
    }

    public void setCoordinateList(List<Coordinate> coordinateList) {
        this.coordinateList = coordinateList;
    }
}

}

@manmohansirionlabs

We have logged an investigation ticket as PDFJAVA-42624 in our issue tracking system for further analysis on this case. We will look into its details and keep you posted with the status of ticket resolution. Please be patient and spare us some time.

We are sorry for the inconvenience.

Hi @asad.ali,
Do you have any update on this issue?

@manmohansirionlabs

The ticket has recently been logged in our issue tracking system and it is pending for a review. We will investigate and resolve it on a first come first serve basis and let you know as soon as we make some progress towards its fix. Please be patient and spare us some time.

We are sorry for the inconvenience.