Hi ,
I am developing a project in java , which reads data from PDF (Marathi - (Indian local Language) ) and that data will be formatted .i.e. Only required fields will be stored in database. e.g.
Name of Voter,Address , age ( we can use for it split() or any other function in String) .
When user tries to search by name then all information about him/her will be displayed . I tried to read data from PDF using UTF-8. Its showing o/p but not in proper format .
i.e. some marathi words and some characters in between them. I want to store clear “Marathi” data in mysql and retrieve it also.
I tried following code for displaying “Marathi” data in console as initial step . after that I will store it in Mysql and then will display it. But following o/p shows only some Marathi woeds and some symbols.
Again in project its required to use “Marathi” keyboard . i.e. user will enter in Marathi data and will get “marathi” o/p.
Note- I also changed default encoding from eclipse by pressing ctrl+Enter . Encoding - UTF-8
Following is code I tried as first step.:
---------------------------------------------------------------------------------------------------------------------
import java.io.IOException;
import java.nio.charset.Charset;
import java.util.Locale;
//iText imports
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
import com.itextpdf.text.pdf.parser.TextRenderInfo;
public class iTextReadDemo {
public static void main(String[] args) {
try {
PdfReader reader = new PdfReader(“D://Vikram//Workspace//Projects//Election//List.pdf”);
System.out.println(“This PDF has “+reader.getNumberOfPages()+” pages.”);
int i=reader.getNumberOfPages();
byte[] bytes = new byte[10];
Locale loc = new Locale(“hi”,“IN”);
for(int i1=1;i1<=i;i1++)
{
String page = PdfTextExtractor.getTextFromPage(reader, 1);
System.out.println(“Page Content:\n\n”+new String(page.getBytes(“UTF-8”))+"\n\n");
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Hi Vikram,
// Open document<o:p></o:p>
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document("c:/pdftest/List.pdf");
// Create TextAbsorber object to extract text
com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();
for(int counter =1; counter<=pdfDocument.getPages().size();counter++)
{
// Accept the absorber for all the pages
pdfDocument.getPages().get_Item(counter).accept(textAbsorber);
System.out.println("Page Content:\n\n"+new String(textAbsorber.getText().getBytes("UTF-8"))+"\n\n");
}
// Get the extracted text
String extractedText = textAbsorber.getText();
// Create a writer and open the file
java.io.FileWriter writer = new java.io.FileWriter(new java.io.File("c:/pdftest/Extracted_text.txt"));
writer.write(extractedText);
// Write a line of text to the file
//tw.WriteLine(extractedText);
// Close the stream
writer.close();
Hi As per my discussion with Mr.Liaz on Banclie chat. I am sendint you the files i/p and o/p
Hi Vikram,
Thanks for sharing the details. We have managed to reproduce the reported issue and logged it in our bug tracking system as PDFNEWJAVA-34131 for further investigation and resolution. We will notify you via this thread as soon as it is resolved.
We are sorry for the inconvenience caused.
Best Regards,
Hi Vikram,
Hi again…
Have you got solution to read Marathi PDF from Java ?
Hi Vikram,
Thanks for inquiry. I am afraid your reported issue is still not resolved due to other priority tasks. However, we have requested our development team to investigate the issue and provide an ETA at their earliest. As soon as we get a feedback, we will update you via this forum thread.
Thanks for your patience and cooperation.
Best Regards,
I am Vikram Hiraman Gore (Java Developer at Rapportsoft consulting and Technology) .
I already posted a query before. This reply is only remind you that I am Developing a Java Project which reads marathi pdf files and stores in MySQL and then retrieve it . All is fine But for composite/ paired characters it shows two-three characters (Half) . How can I achieve clear reading of marathi PDF. ? I used Aspose.jar (License period) .I searched a lot on the net and found that there is no encoding for some indic devnagari scripts for composite(paired) characters. Isn't there? My requirement is very urgent basis and whole project is stopped because of it. can you suggest me what to do for reading "Marathi" PDF from Java? Currently I am using "UTF-8" encoding. If It reads proper Marathi then we will buy the Jar .
Hi Vikram,