Extract Doc data (statement by statement)

MehulJSheth · November 13, 2011, 10:36am

Hi,

I want to process word data in such a way that it should be extracted statement by statement and at the same time it should preserve formatting also.

For e.g.
Source Word Doc:

This is demo file…

First Line
Second LIne
End of Demo file

I want to extract each of the statement (not line because statement might contain more than one line) want to precess it and save it to some other file.
Say

This is demo file…

Modified First Line
Modified Second LIne
End of Demo file

In this case numbered list is preserved and it should preserve all fonts,style and everything. If this is possible then can you please give me some direction to achieve it or kind of sample code for it.

awais.hafeez · November 13, 2011, 11:38pm

Hi Mehul,

Thanks for your inquiry.

Upon loading a Word document, Aspose.Words generates a rich DOM (Document Object Model) to be able to programmatically parse and modify loaded documents. To get features overview of Aspose.Words , I would suggest you to please read out the following link.

https://docs.aspose.com/words/net/product-overview/

Secondly, yes, upon loading a numbered list, Aspose.Words does preserve fonts and styles. Moreover, for code samples and to get started with lists, I would suggest you the following links:

https://reference.aspose.com/words/net/aspose.words.lists/list/

https://reference.aspose.com/words/net/aspose.words.lists/listtemplate/

I hope, this will help.

Best Regards,

MehulJSheth · November 14, 2011, 12:03am

Hi Awais,

Thanks for the reply. I have already gone through the document model generated by Aspose. The example i gave was just to explain you the concern. It’s not that i just want to work with lists. Let me explain it in brief.

Say i have a large word doc with say 100-200 pages long or may be more.
I want to create a new document which is the copy of the source doc but with few modification like i might change one of the bullet value or might change title of some paragraph or something like that.

So what exactly i need is to process source doc statement by statement and at the same time it should preserve all those format and if i change something in the fetched statement, it should affect only to the text and not to indentation and format and now modified statement i want to save it to the new doc.

doc 1 doc 2

my doc1 mydoc2

first 1. first
second 2. second
this is in italic this is also in italic

If possible can you please share some code that performs similar kind of functionality. like reading from a doc while preserving all format and saving to another doc with modification.*

MehulJSheth · November 14, 2011, 12:10am

Hi Awais,

To add further, the link which you suggested shows how to create list. What i want is when i am reading a statement from source doc, is it possible to know whether that statement has bullet or any other property and if it is i should preserve the same just by changing text value.

There is something called FormattedText property in Interop API which allows to copy statement in the same format but when i change the text value it losses the format and i am not able to save it with bullet, number etc. I want to preserve that.

awais.hafeez · November 14, 2011, 1:25am

Hi,

Thanks for the additional information. You can load the input document into stream and then generate its exact clone by using the following code snippet:

// Open the stream. Read only access is enough for Aspose.Words to load a document.
Stream stream = File.OpenRead(@"C:\test\demofile.docx");
// Load the entire document into memory.
Document docCopy = new Document(stream);
// You can close the stream now, it is no longer needed because the document is in memory.
stream.Close();

You will only be able to modify this document once it is loaded into memory. Moreover, bulleted lists are represented by Paragraph nodes in DOM. You can identify/modify List nodes content by using the following code:

DocumentBuilder builder = new DocumentBuilder(docCopy);
NodeCollection paras = docCopy.GetChildNodes(NodeType.Paragraph, true);
foreach (Paragraph para in paras)
{
    if (para.ListFormat.IsListItem)
    {
        builder.MoveToParagraph(docCopy.FirstSection.Body.Paragraphs.IndexOf(para), 0);
        builder.Write("Modified paragraph");
    }
}

Also, please see the following link for a description of IsListItem property:

https://reference.aspose.com/words/net/aspose.words.lists/listformat/islistitem/

If we can help you with anything else, please feel free to ask.

Best Regards,

MehulJSheth · November 14, 2011, 3:46am

Hi Awais,

Thanks for the post again. That helped a lot but still i have few doubts. As mentioned in my previous post, a source document may contain anything supported by Word. What i am concerned is, i can easily check for each of the para whether it’s list or not using IsListItem which perfectly fine but how should i check for others like font type, style, everything need to be preserved. In that case i’ll have to put if-else for each of them.

As mentioned earlier, suppose i use para.range.FormattedText to copy to another doc. so it’ll preserve all format for that para now i just want to replace para text by something else but should preserve the format but when i do para.range.text=“modofied”, it’s not preserving format.

So instead of creating a clone using stream what i want is, open source doc, read statement (not line because statement can me made up of multiple lines) copy it to target doc using formattedText then get modified string and replace with the one just copied. So do i need to check for those list, title of para and all those properties? Is there any other way to achieve this? Is there a way that i can find out formatting for a sentence also?

awais.hafeez · November 14, 2011, 6:30am

Hi,

Thanks for your request.

You just don’t need to explicitly check the font types or styles by yourself; Aspose.Words maintains the content styles in the output document as these were in the original document. So, there should not be any difference in font styles.

Also, to access the font properties of Paragraph, please visit the following link and try using Font class:

https://reference.aspose.com/words/net/aspose.words/font/

Moreover, in case you are getting any formatting/styling issues; could you please create and attach here a simple little application and input document which would enable us to reproduce this issue on our side?

Please let us know if you need more information, We are always glad to help you.

Best Regards,

MehulJSheth · November 16, 2011, 3:57am

Hi Awais,

Thanks for the reply. I was unable to attach file here so i’ve uploaded sample docs on a website. Can you please download it so it’ll give clear idea on what i want.
http://www.mediafire.com/?we8j5jpj21ia5pf

There are 2 doc files in zip file. One is source.doc and another is destination.doc. Now i want to know is it possible using Aspose word API to get the formatting, structure information for page,char,table,line,images etc so that i can modify content and produce the same structure and format in destination doc. If your API supports that feature then we are ready to purchase both word and pdf.

If you have got the idea of what i want then now can you please guide on how to retrieve pargraph wise or statement wise from source.doc and create destination.doc with modified text but preserving format. Please let me know if you want some other information.

MehulJSheth · November 16, 2011, 4:18am

Hi,

Also wanted to know what languages it supports? I mean Unicode,English , Asian languages etc…

Thanks

awais.hafeez · November 16, 2011, 6:12am

Hi,

Thanks for the additional information.

Firstly, please note that DocumentExplorer is a very useful tool which easily enables us to see the entire document structure. You can find DocumentExplorer in the folder where you installed
Aspose.Words e.g. C:\Program Files (x86)\Aspose\Aspose.Words for .NET\Demos\CSharp\DocumentExplorer\bin\DocumentExplorer.exe.

Secondly, yes; by investigating your SampleDocs, please note that Aspose.Words does support all the features you requested in your previous post.

Moreover, to clarify you further, I would like to share the following code snippet in order to modify the contents of Table Cell:

Document doc = new Document(@"c:\test\Source.doc");
NodeCollection tables = doc.GetChildNodes(NodeType.Table, true);
Table table = (Table)tables[0];
Cell cell = table.Rows[3].Cells[1]; 
cell.Paragraphs.RemoveAt(0);
Run run = new Run(doc);
run.Text = "Some modification I want to do";
Aspose.Words.Font font = run.Font;
font.Size = 12;
font.Bold = false;
font.Color = System.Drawing.Color.Red;
font.Name = "Verdana"; 
Paragraph para = new Paragraph(doc);
para.Runs.Add(run); 
ParagraphFormat paragraphFormat = para.ParagraphFormat;
paragraphFormat.FirstLineIndent = 8;
paragraphFormat.Alignment = ParagraphAlignment.Justify;
paragraphFormat.KeepTogether = true;
cell.Paragraphs.Add(para);
doc.Save(@"c:\test\Destination.doc");

Also, Aspose.Words does support all languages.

If we can help you with anything else, please feel free to ask.

Best Regards,

MehulJSheth · November 16, 2011, 10:32pm

Hi Awais,

Thanks again for the reply. That was really helpful for me. I went through Document Explorer and it’s nice. Got cleared few of my doubts. After going through all of these i came up with my own idea of how i will go further please let me know if i am missing something.

Open source doc.
Read paragraphs one by one (But how will i get format information about each paragraph, the one you have described in your previous post shows how to create new one. I want to know font style, size, whether it’s a bullet or shape(image) because both bullet and shape comes under paragraph as what i came to know from DocumentExplorer.exe. Also how will i know their layout info and all… It would be really helpful if you can give some kind of sample example like you gave for creating new one in your earlier post.)
Modify paragraph if needed based on requirement.
Write to destination.doc. Apply all those formats we read in step 2 so that destination.doc will have same layout.

One more thing, as what you said that Aspose supports all the languages but when i tried to create new doc with some other language it was showing boxes and not in that particular language.

I tried to create the same using interop where i was able to see in the language i created so i guess there is no issue of font, that font is already installed in the system.

Again thanks a lot for your support so far.

MehulJSheth · November 17, 2011, 12:36am

Hi Awais,

Also please let me know how to add shape(normal shape like rectangle,flow chart etc and image) to paragraph. I just want to copy/paste those things from source doc to destination but couldn’t find out how to do it like how to detect that it’s a shape from source and how to add it to destination doc.

From DocumentExplorer, i could find that it stores image and other shapes as Shape type so i was able to detect but how to retrieve it from source doc and put into destination doc because i dont think we can create Shape object in Aspose.

awais.hafeez · November 17, 2011, 1:14am

Hi,

Thanks for your inquiry. Please note that a Paragraph can contain many Runs of text and each Run can have different formatting (font size, family); I would suggest you to visit the following link for more details about Run class:

https://reference.aspose.com/words/net/aspose.words/run/

Moreover, you can modify the text of Paragraph by using the Text property of Run.

Also, please visit the following API link for details about Shape class:

http://www.aspose.com/documentation/.net-components/aspose.words-for-.net/aspose.words.drawing.shape.html

Also, please attach your input document i.e. having contents of different language, here for testing. I will investigate the issue on my side and provide you more information.

I hope, this will help.

Best Regards,

MehulJSheth · November 17, 2011, 1:56am

Hi Awais,

Thanks for the reply. I understood that paragraph can contain text in the form of Run.Text. My doubts are

Pargraph-> Run-> Text
Paragraph->Shape
Paragraph-> Shape-> Paragraph-> Run -> Text

These are the three ways i found for paragraph for my source doc. Now what i am concerned is if i encountered a shape and without modifying anything if i want to copy/paste to target document is it possible? Like if there is some drawing object in source doc without modifying just cop/paste or if there is some image copy/paste to target doc.

Copying paragraph is possible or not?
Paragraph p=new Paragraph(targetDoc);
p=sourceDoc.sections[0].body.paragraphs[0];
I tried this but it’s giving exception saying node is created from some different doc.
If this is not possible is there any way just to copy/paste paragraph because in some cases i don’t want to modify paragraphs from source doc so will just read from source doc and will paste to target doc.

It would help me if you can give some kind of code for it just to get an idea.
You can refer source.doc and destination.doc file i sent you earlier. to get an idea of my 2 issues mentioned above.

Regarding Language translation: Just to check whether it works or not i created sample code as below

String fileName = "C:\testing.doc";
Document source = new Document(fileName);
Document tt = new Document();
tt.RemoveAllChildren();
//Create section for target doc
Section section = new Section(tt);
tt.AppendChild(section);
//Create body for target doc
Body body = new Body(tt);
section.AppendChild(body);

Paragraph p1=new Paragraph(tt);
body.AppendChild(p1);
Run r =new Run(tt);
r.Text = "some text in other language"
p1.AppendChild®;

Console.ReadLine();
tt.Save("C:\testing2.doc");

SourceFile Content:

The Service Location Protocol (SLP)** is a service discovery protocol that allows computers and other devices to find services in a LAN without prior configuration. Since the current implementation of SLP is limited to LAN and is not scable to internet, we offer an implementation that provides similar functionality as that of SLP in a cloud that helps client to select the most appropriate service.

Destination File Content:

सेवा स्थान प्रोटोकॉल (SLP) एक सेवा डिस्कवरी प्रोटोकॉल है कि अनुमति देता है (In place of this i am getting boxes which should not be as when i am using interop API, it’s able to show me the same text in the language i have translated like shown here)

So it’s able to convert but not able to show…

Also one more thing, even though i am using entire paragraph range, it’s taking only first line of it.

Console.WriteLine(source.Sections[0].Body.Paragraphs[1].Range.Text);

this one should return entire paragraph but its giving only first line : The Service Location Protocol (SLP)** is a service discovery protocol that
allows

awais.hafeez · November 17, 2011, 4:39am

Hi,

Thanks for your request. You can perform repeated import of nodes by using the NodeImporter class; for more details, please visit the following link:

https://reference.aspose.com/words/net/aspose.words/nodeimporter/

Moreover, please note that each Paragraph should end with \r (ParagraphBreak) and you can view paragraph endings by enabling the ‘Display Paragraph Marks’ option in MS Word.

Please let us know if you need more information, We are always glad to help you.

Best Regards,

MehulJSheth · November 17, 2011, 5:25am

Hi Awais,

Thanks a lot for the reply. NodeImported helped a lot. Can you please elaborate more on paragraph break.

I have written some code and trying to copy/paste only those paragraph having shape but not getting correct output. I guess it’s overlapping with previous one and that is because of paragraph break i guess. Can you please look into it once.

for (int i = 0; i < source.Sections.Count; i++)
{

    for (int l = 0; l < source.Sections[i].Body.ChildNodes.Count; l++)
    {
        Node n = source.Sections[i].Body.ChildNodes[l];

        if (n.NodeType.ToString().Equals("Paragraph"))
        {
            Paragraph p = (Paragraph)n;
            if (p.HasChildNodes)
            {
                for (int m = 0; m < p.ChildNodes.Count; m++)
                {
                    if (p.ChildNodes[m].NodeType.ToString().Equals("Shape"))
                    {
                        Paragraph para = new Paragraph(tt);
                        body.AppendChild(para);
                        NodeImporter ni = new NodeImporter(source, tt, ImportFormatMode.KeepSourceFormatting);
                        Node ins = ni.ImportNode(p.ChildNodes[m], true);
                        para.InsertAfter(ins, null);
                        Console.WriteLine("Child:" + p.ChildNodes[m].NodeType);
                    }
                }
            }

        }
        else if (n.NodeType.ToString().Equals("Table"))
        {
            Console.WriteLine(l + ":Table");
        }
    }
}

Thanks for the help.

awais.hafeez · November 17, 2011, 7:15am

Hi,

Thanks for your request. I would suggest you to please use the following code snippet that worked for me to be able to export only those paragraphs from source document that contain Shape nodes to the target document:

Document source = new Document(@"c:\test\source.doc");
Document destination = new Document();
NodeImporter ni = new NodeImporter(source, destination, ImportFormatMode.KeepSourceFormatting);
Body destBody = destination.FirstSection.Body;
NodeCollection paragraphs = source.GetChildNodes(NodeType.Paragraph, true);
foreach (Paragraph paragraph in paragraphs)
{
    if (paragraph.GetChildNodes(NodeType.Shape, true).Count > 0)
    {
        Node node = ni.ImportNode(paragraph, true);
        destBody.AppendChild(node);
    }
}
destination.Save(@"c:\test\destination.doc");

Moreover,to get a better understanding of Paragraphs, I would suggest you to visit the following API link:

https://reference.aspose.com/words/net/aspose.words/paragraph/

I hope, this will help.

Best Regards,

MehulJSheth · November 17, 2011, 10:20pm

Hi Awais,

Thanks a lot for your reply. That helped me a lot. I will go through the link you have provided and straight away start working. If you don’t mind i’ll keep on posting queries if i come across while going further.

Thanks,
Mehul

awais.hafeez · November 17, 2011, 11:14pm

Hi Mehul,

Thank you for considering Aspose.Words. You are welcome to ask as many questions as you need, please let us know if you need more assistance, we will be glad to help you.

Best Regards,

MehulJSheth · November 18, 2011, 12:05am

Hi Awais,

Thanks for the help. I got stuck with another issue. As i mentioned earlier, i am interested in working only with text from the source doc and don’t want to deal with formatting. My source doc is not fixed. It can change so i have to make my code generalize to fit to all kinds of docs.

Each time when i fetch text from source doc, i don’t want it formatting to be disturbed. Like suppose i have a table with 4 rows and 2 columns having background color, font style and all. Also i have some paragraph with different color, then i just want to work with the text by preserving format,background color information and all.

One way is to read formatting info for all node say if text is bold,italic,has underline or cell has background color etc .and then while writing to text doc apply the same format in destination doc. but i need to check for all kinds of possible formatting as i don’t know what my source doc can contain.

Is there any other way to achieve this. May be something like
dest_cell.format=source_cell.format
dest_para.format=source_para.format
dest_run.font=souce_run.font
Something similar to NodeImporter but with option to change the text for that node
Node may be shape which may contain text, table with text or something else.

One another way i am thinking of is: I do copy all nodes one by one using NodeImporter to destination doc and just replace the run.text for each of the node once it’s appended to body of destination.doc.

which takes care of all those format,font size,style etc and allow me to work only on text so i don’t need to check for all those formatting issue and all…