Cannot read conjuction word in hindi using pdf java packages

Hi Team,


I want to parse hindi (Indian language ) words from pdf in java. i am using trial pack of pdf java.

Please help me for conjunction word like vyavhare, sindhudurg, shambhuraj , dattatray etc

If this problem resolved then i will be able to convince my company to purchase pdf java pack.

Please find the attached pdf which I am trying to parse.

Please help me.

Please mail me on shivraj159@gmail.com
+919011075932


Hi Shivraj,


Thanks for contacting support.

From your above stated requirement on parsing words, can you please share some further details i.e. either you are retrieving words or trying to replace them etc. If possible, please share the code snippet so that we can test the scenario at our end. We apologize for your inconvenience.

As discussed, my code is following, i got outout but some hindi conjuction words are not showing…

Please check http://103.23.150.75/searchpdf/pdf/A070/A0700015.pdf

PDF output is not matching Please check

package com.Vikram;

import java.net.URLDecoder;

import java.net.URLEncoder;

import java.sql.*;

import java.io.BufferedReader;

import java.io.DataInputStream;

import java.io.FileInputStream;

import java.io.FileNotFoundException;

import java.io.InputStreamReader;

import java.util.regex.Matcher;

import java.util.regex.Pattern;

import com.lowagie.text.pdf.PdfReader;

//import com.aspose and sub files after download

public class Test {

public static void main(String[] args) throws Exception

{

Class.forName(“com.mysql.jdbc.Driver” );

String jdbc=“jdbc:mysql://localhost/Hindi_test?user=root&password=root”;

String uc="&useUnicode=true&characterEncoding=UTF-8";

com.aspose.pdf.License license= new com.aspose.pdf.License();

try

{

license.setLicense(“C://rapport//loksabha//work//JARS//Aspose.lic”);

}

catch (Exception e)

{

e.printStackTrace();

}

com.aspose.pdf.License license1= new com.aspose.pdf.License();

try

{

license.setLicense(new java.io.FileInputStream(“C://rapport//loksabha//work//JARS//Aspose.lic”));

} catch (FileNotFoundException e2)

{

e2.printStackTrace();

}

String fileName = “”;

for(int i = 2480005; i < 2480259 ; i++)

{

Connection con= DriverManager.getConnection(jdbc+uc) ;

java.sql.PreparedStatement ps= con.prepareStatement(“insert into voter values(?,?,?,?,?,?,?,?)”);

fileName = “A” +i+".pdf";

System.out.println(“reading file… “+fileName);

com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(“C://rapport//loksabha//pgms//248//”+fileName);

com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();

PdfReader reader = new PdfReader(“C://rapport//loksabha//pgms//248//”+fileName);

int count = reader.getNumberOfPages();

System.out.println(count);

for(int counter =1; counter<= (count-1);counter++)

{

pdfDocument.getPages().get_Item(counter).accept(textAbsorber);

[//System.out](https://system.out/). println(”\n======================================\n”+new String(textAbsorber.getText().getBytes(“UTF-8”))+"\n\n ==============================================================================");

}

String extractedText = textAbsorber.getText();

java.io.FileWriter writer = new java.io.FileWriter(new java.io.File(“C://Extracted_text.doc”));

writer.write(extractedText);

writer.close();

FileInputStream fstream = new FileInputStream(“C://Extracted_text.doc”);

DataInputStream in = new DataInputStream(fstream);

BufferedReader br = new BufferedReader(new InputStreamReader(in));

String strLine,Assembly="",yadi_bhag="",Voting_place="";

String n1[],n2[],n3[];

String name1="",name2="",name3="",sn1="",mdd1="",nm1="",sn2="",mdd2="",nm2="",sn3="",mdd3="",nm3="";

String ID1="",ID2="",ID3="",ID="";

String sr="";

String s1="",s2="",s3="";

int l1,l2;

int t1=0,t2=0,t3=0,flag=0,flag1=0;

while ((strLine = br.readLine()) != null)

{

Pattern pattern=Pattern.compile("[a-zA-Z]");

Matcher m=pattern.matcher(strLine);

Pattern patternd=Pattern.compile("[0-9]");

Matcher md=patternd.matcher(strLine);

t1=0;

if(strLine.contains(“मतदार संघ”)&&flag!=1)

{

int tt=strLine.indexOf(“मांक”);

tt+=6;

Assembly =strLine.substring(0,4);

System.out.print(Assembly);

yadi_bhag =strLine.substring(tt);

System.out.println(yadi_bhag);

flag=1;

[//strLine=br.readLine](https://strline%3Dbr.readline/)();

}

if(strLine.contains(“माकं व नावं”)&&!(strLine.contains(“लोकसभा”))&&flag1!=1)

{

int tt=strLine.indexOf(“नावं”);

tt+=6;

Voting_place =strLine.substring(35,45);

if((strLine = br.readLine()) != null)

Voting_place+=strLine.substring(35,45);

strLine=br.readLine();

Voting_place+=strLine.substring(26);

strLine=br.readLine();

strLine=br.readLine();

Voting_place+=strLine.substring(0);

System.out.println(Voting_place);

flag1=1;

}

if(m.find()&& (!(strLine.contains(“गायब”))&&(!(strLine.contains(“Photo”)))&&(!(strLine.contains(“Available”)))&&(!(strLine.contains(“समािव”)))&&(!(strLine.contains(“वगळ”)))&&(!(strLine.contains(“मूळ यादी”)))&&(!(strLine.contains(“एकू ण”)))))

{

System.out.println(“strLine all:”+strLine);

if(strLine.length()>=8)

s1=strLine.substring(2,8);

else s1=" “;

System.out.println(“s1 all:”+s1);

if(strLine.length()>=52)

s2=strLine.substring(47,52);

else s2=” “;

if(strLine.length()>=96)

s3=strLine.substring(91,96);

else s3=” “;

if(strLine.length()>=28)

ID1=strLine.substring(10,28);

else if (strLine.length()>=20)

ID1=strLine.substring(10,20);

else ID1=” “;

if(strLine.length()>=72)

ID2=strLine.substring(55,72);

else if (strLine.length()>=65)

ID2=strLine.substring(55,65);

else ID2=” “;

if(strLine.length()>=116)

ID3=strLine.substring(99,116);

else if (strLine.length()>=109)

ID3=strLine.substring(99,109);

else ID3=” “;

System.out.println(“S1 = “+s1+ " s2 =”+s2+ " s3 =”+s3);

System.out.println(“ID1 = “+ID1+ " ID2 =”+ID2+ " ID3 =”+ID3);

[//System.out.println](https://system.out.println/)(strLine);

}

if(strLine.contains(“पुणर्”))

{

if(strLine.length()>=38)

name1=strLine.substring(16,38);

else

name1=strLine.substring(16);

if(name1.contains(”:"))

name1=name1.substring(1);

name1=name1.trim();

n1=name1.split(" “);

[//System.out.println](https://system.out.println/)(“Name1 = “+name1);

if(strLine.length()>=81)

name2=strLine.substring(60,81);

else

if(strLine.length()>=78)

name2=strLine.substring(60,78);

else

if(strLine.length()>=75)

name2=strLine.substring(60,75);

else

if(strLine.length()>=71)

name2=strLine.substring(60,71);

else name2=” “;

if(name2.contains(”:”))

name2=name2.substring(1);

name2=name2.trim();

if(strLine.length()>=104)

{

name3=strLine.substring(103);}

else

name3=” “;

if(name3.contains(”:"))

name3=name3.substring(2);

name3=name3.trim();

System.out.println(name1);

System.out.println(name2);

System.out.println(name3);

n1= name1.split(" “);

n2= name2.split(” “);

n3= name3.split(” “);

if(n1.length>3)

{

sn1=n1[0];

nm1=n1[1];

mdd1=n1[2];

mdd1+=n1[3];

}

else if(n1.length>2)

{

sn1=n1[0];

nm1=n1[1];

mdd1=n1[2];

}

else if(n1.length>1)

{

sn1=n1[0];

nm1=n1[1];

}

else nm1=mdd1=sn1=” “;

if(n2.length>3)

{

sn2=n2[0];

nm2=n2[1];

mdd2=n2[2];

mdd2+=n2[3];

}

else if(n2.length>2)

{

sn2=n2[0];

nm2=n2[1];

mdd2=n2[2];

}

else if(n2.length>1)

{

sn2=n2[0];

nm2=n2[1];

}

else nm2=mdd2=sn2=” “;

if(n3.length>3)

{

sn3=n3[0];

nm3=n3[1];

mdd3=n3[2];

mdd3+=n3[3];

}

else if(n3.length>2)

{

sn3=n3[0];

nm3=n3[1];

mdd3=n3[2];

}

else if(n3.length>1)

{

sn3=n3[0];

nm3=n3[1];

}

else nm3=mdd3=sn3=” ";

System.out.println("Name1 "+nm1 + " Middle1= "+mdd1+ " Sirname1= "+sn1);

System.out.println("Name2 "+nm2 + " Middle2= "+mdd2+ " Sirname2= "+sn2);

System.out.println("Name3 "+nm3 + " Middle3= "+mdd3+ " Sirname3= “+sn3);

ps.setString(1,s1);

ps.setString(2,nm1);

ps.setString(3,mdd1);

ps.setString(4,sn1);

ps.setString(5,ID1);

ps.setString(6,Assembly);

ps.setString(7,yadi_bhag);

ps.setString(8,Voting_place);

ps.executeUpdate();

System.out.println(”--------------Yadi bhag “+yadi_bhag);

System.out.println(”---------------Assembly “+Assembly);

System.out.println(”---------------Voting Place "+Voting_place);

ps.setString(1,s2);

ps.setString(2,nm2);

ps.setString(3,mdd2);

ps.setString(4,sn2);

ps.setString(5,ID2);

ps.setString(6,Assembly);

ps.setString(7,yadi_bhag);

ps.setString(8,Voting_place);

ps.executeUpdate();

ps.setString(1,s3);

ps.setString(2,nm3);

ps.setString(3,mdd3);

ps.setString(4,sn3);

ps.setString(5,ID3);

ps.setString(6,Assembly);

ps.setString(7,yadi_bhag);

ps.setString(8,Voting_place);

ps.executeUpdate();

}

}

System.out.println("Successfully done … "+fileName);

in.close();

ps.close();

con.close();

}

}

}

Hi Shivraj,


Thanks for sharing the code snippet.

I
have tested the scenario and I am able to notice the same problem. For the sake
of correction, I have logged this problem as PDFNEWJAVA-34138 in
our issue tracking system. We will further look into the details of this
problem and will keep you updated on the status of correction. Please be
patient and spare us little time. We are sorry for this inconvenience.<o:p></o:p>

Team,


When you will responding me.

It is too late.

Please provide solution.

Waiting for response

Hi Shivraj,


As we recently have been able to notice this
issue, so development team requires little time to investigate and figure out
the reasons of this problem.
Furthermore, please note that you have reported issue under normal/free support
forum and as a normal rule of practice, issues are resolved in first come
and first serve basis; but the problems logged/reported under Enterprise or
Priority support model, have high precedence in terms of resolution, as compare
to issues under normal/free support model.

Nevertheless,
as soon as we have made some definite progress towards its
resolution, we would be more than happy to update you with the status of
correction.
<o:p></o:p>

<span style=“font-size:10.0pt;font-family:“Arial”,“sans-serif”;
mso-fareast-font-family:“Times New Roman”;color:#333333”><o:p></o:p>

Hi ShivRaj

I was searching for PDF generation in Hindi and other Indian Languages with Aspose and found out your thread with the same issue.
Just wanted to know as were you able to create PDF with complex or rendered hindi word.
eg Kranti , durg etc.

Could you please let me know if you got it working with Aspose PDF Java.

Thanks in advance for the help


Regards
Manish D Sharma

Hi Manish,


Thanks for your inquiry. Aspose.Pdf supports most of the Languages. Please try Kranti and Durg with Aspose.Pdf for Java it should work. Please let us know if you find any issue, so we will look into it and try our best to resolve it as soon as possible.

Best Regards,

Hi Team,

The normal words are displayed correctly but the ligatures are not supported .
Can you please point me to a link with some example code or API which are tested for complex characters or ligatures with any of Indian Language.

Thanks
Manish

Hi Manish,


In case you are using Aspose.Pdf for Java in a scenario where fonts other than standard Acro fonts are required, you need to specify the path where font files are placed/located. Please try using following code lines before performing any PDF creation and manipulation activity.

[Java]

/
Getting the list for standard font directories in different OS
<o:p></o:p>

java.util.List list = com.aspose.pdf.Document.getLocalFontPaths();

// Setting the user list for standard font directories

list.add("c:/windows/fonts/");


In case you still face any problem, please share the resource font files and details regarding the scenario in which you are trying to use the API i.e. Adding text to PDF, adding text stamp to PDF, creating PDF file from scratch etc.

Once the PDF is created we checked the font used(Arial Unicode MS), the font was successfully added to PDF and shown in PDF font properties.

So is it possible that post PDF creation, the PDF font Properties is showing the font name embedded(Arial Unicode MS) and still during rendering, the font was not available.


Hi Manish,


Thanks for sharing your feedback. I am afraid we cannot share our finding at the moment, as issue is pending for investigation. As soon as our development team completes the investigation then we will be in a good position to share our findings/ETA with you. We will keep you updated about the issue resolution progress via this forum thread.

Thanks for your patience and cooperation.

Best Regards,
manishdsharma:
Once the PDF is created we checked the font used(Arial Unicode MS), the font was successfully added to PDF and shown in PDF font properties.
So is it possible that post PDF creation, the PDF font Properties is showing the font name embedded(Arial Unicode MS) and still during rendering, the font was not available.
Hi Manish,

As per my understanding, do you mean that when viewing the PDF properties, the font name/information is displayed under Fonts tab but the contents are not rendered in embedded font. Please note that PDF file information is present inside PDF metadata and displaying the contents in respective font is capability of PDF viewing application (Adobe Reader etc).