Java code to read "Marathi" (Indian local language) PDF and store it in MySQL and retrieving it

Hi ,
I am developing a project in java , which reads data from PDF (Marathi - (Indian local Language) ) and that data will be formatted .i.e. Only required fields will be stored in database. e.g.
Name of Voter,Address , age ( we can use for it split() or any other function in String) .

When user tries to search by name then all information about him/her will be displayed . I tried to read data from PDF using UTF-8. Its showing o/p but not in proper format .
i.e. some marathi words and some characters in between them. I want to store clear “Marathi” data in mysql and retrieve it also.

I tried following code for displaying “Marathi” data in console as initial step . after that I will store it in Mysql and then will display it. But following o/p shows only some Marathi woeds and some symbols.

Again in project its required to use “Marathi” keyboard . i.e. user will enter in Marathi data and will get “marathi” o/p.

Note- I also changed default encoding from eclipse by pressing ctrl+Enter . Encoding - UTF-8

Following is code I tried as first step.:
---------------------------------------------------------------------------------------------------------------------
import java.io.IOException;
import java.nio.charset.Charset;


import java.util.Locale;


//iText imports
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
import com.itextpdf.text.pdf.parser.TextRenderInfo;
public class iTextReadDemo {
public static void main(String[] args) {
try {
PdfReader reader = new PdfReader(“D://Vikram//Workspace//Projects//Election//List.pdf”);
System.out.println(“This PDF has “+reader.getNumberOfPages()+” pages.”);
int i=reader.getNumberOfPages();
byte[] bytes = new byte[10];
Locale loc = new Locale(“hi”,“IN”);
for(int i1=1;i1<=i;i1++)
{
String page = PdfTextExtractor.getTextFromPage(reader, 1);
System.out.println(“Page Content:\n\n”+new String(page.getBytes(“UTF-8”))+"\n\n");
}

} catch (IOException e) {
e.printStackTrace();
}
}
}






Hi Vikram,


Thanks for contacting support.

As per my understanding, your code is based on iTextPdf and its not based on Aspose.Pdf for Java. However, I would like to share that Aspose.Pdf for Java supports the feature to extract text from PDF document and during my testing with Aspose.Pdf for Java 4.6.0, I have noticed that Marathi text is properly being extracted. For your reference, I have also attached the file containing extracted text.

[Java]

// Open document<o:p></o:p>

com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document("c:/pdftest/List.pdf");

// Create TextAbsorber object to extract text

com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();

for(int counter =1; counter<=pdfDocument.getPages().size();counter++)

{

// Accept the absorber for all the pages

pdfDocument.getPages().get_Item(counter).accept(textAbsorber);

System.out.println("Page Content:\n\n"+new String(textAbsorber.getText().getBytes("UTF-8"))+"\n\n");

}

// Get the extracted text

String extractedText = textAbsorber.getText();

// Create a writer and open the file

java.io.FileWriter writer = new java.io.FileWriter(new java.io.File("c:/pdftest/Extracted_text.txt"));

writer.write(extractedText);

// Write a line of text to the file

//tw.WriteLine(extractedText);

// Close the stream

writer.close();

Hi As per my discussion with Mr.Liaz on Banclie chat. I am sendint you the files i/p and o/p

Hi Vikram,

Thanks for sharing the details. We have managed to reproduce the reported issue and logged it in our bug tracking system as PDFNEWJAVA-34131 for further investigation and resolution. We will notify you via this thread as soon as it is resolved.

We are sorry for the inconvenience caused.

Best Regards,

Hi Vikram,


Thanks for following on live chat. I am afraid as we have noticed your issue recently, it is pending for investigation in queue with other priority task. As soon as its investigation is completed then we will be in a good position to share an ETA with you.

We are sorry for the inconvenience caused.

Best Regards,

Hi again…
Have you got solution to read Marathi PDF from Java ?

Hi Vikram,


Thanks for inquiry. I am afraid your reported issue is still not resolved due to other priority tasks. However, we have requested our development team to investigate the issue and provide an ETA at their earliest. As soon as we get a feedback, we will update you via this forum thread.

Thanks for your patience and cooperation.

Best Regards,

Hi,
I am Vikram Hiraman Gore (Java Developer at Rapportsoft consulting and Technology) .

I already posted a query before. This reply is only remind you that I am Developing a Java Project which reads marathi pdf files and stores in MySQL and then retrieve it . All is fine But for composite/ paired characters it shows two-three characters (Half) . How can I achieve clear reading of marathi PDF. ? I used Aspose.jar (License period) .I searched a lot on the net and found that there is no encoding for some indic devnagari scripts for composite(paired) characters. Isn't there? My requirement is very urgent basis and whole project is stopped because of it. can you suggest me what to do for reading "Marathi" PDF from Java? Currently I am using "UTF-8" encoding. If It reads proper Marathi then we will buy the Jar .
Thanking you in advance.
Sincerely ,
Vikram Hiraman Gore.
(Rapportsoft consulting and technology ,Pune)

Hi Vikram,


Thanks for your inquiry. We have recorded your concern and also shared with the respective team. I am afraid currently our development team is working over other priority issues, as we serve on first come first basis. However, we have requested them to investigate it at their earliest and share an ETA. We will notify you as soon as we get a feedback.

Thanks for your patience and cooperation.

Best Regards,