For convert split/docx to pdf

dlalwani · November 13, 2013, 9:40pm

I have word document(docx) with some data in each page. I want to split this document based on some criteria(value for a particular field in a line) and convert those to pdf file.
Is possible using Aspose. If yes please advise how?

tahir.manzoor · November 15, 2013, 2:27am

Hi there,

Thanks for your inquiry. It would be great if you please share following detail for investigation purposes.

Please attach your input Word document.
Please share the criteria for which you want to split the document
Please
attach your target Word document showing the desired behavior. You can
use Microsoft Word to create your target Word document. I will
investigate as to how you are expecting your final document be generated
like.

As soon as you get these pieces of information to us we will then provide you more information about your query along with code.

dlalwani · November 17, 2013, 5:05pm

Hi Tahir,

Please see attached word file.This is source file.

You can see “Fleet No” field value in this file. I need to split this file in PDFs file for each “Fleet No”

Attached PDF file is output file for one fleet 12291 , i need to be able to split in same way for all “Fleet No” in file.

This would need to run on Citrix Enviornment on Windows Server 2008 R2

Let me know if feasible in using Aspose.

Thanks
Dhiraj

tahir.manzoor · November 18, 2013, 5:30am

Hi Dhiraj,

Thanks
for your inquiry. In your case, you need to find the pages on which fleet number exists. Once you have the page number, please use the PageFinder utility to extract the specific pages from the document.

The Aspose.Words.Layout namespace provides classes that allow to access information such as on what page and where on a page particular document elements are positioned, when the document is formatted into pages. Please the detail from here:
https://reference.aspose.com/words/net/aspose.words.layout/layoutcollector/
https://reference.aspose.com/words/net/aspose.words.layout/layoutenumerator/

I have attached the PageFinder code with this post. Hope this helps you. Please let us know if you have any more queries.

Document doc = new Document(MyDir + "ShieldInsurance.docx");
LayoutCollector layoutCollector = new LayoutCollector(doc);
LayoutEnumerator layoutEnumerator = new LayoutEnumerator(doc);
PageNumberFinder pageFinder = new PageNumberFinder(doc);
string fleetno = "12291";
var collection = doc.GetChildNodes(NodeType.Paragraph, true);
int pageStart = 0, pageEnd = 0;
Boolean blnFirst = true;
foreach (Paragraph para in collection)
{
    if (para.ToString(SaveFormat.Text).Trim().Contains(fleetno))
    {
        var renderObject = layoutCollector.GetEntity(para);
        layoutEnumerator.Current = renderObject;
        // RectangleF location = layoutEnumerator.Rectangle;
        int page = layoutEnumerator.PageIndex;

        if (blnFirst)
        {
            pageStart = page;
            blnFirst = false;
        }
        pageEnd = page;
    }
}
// Set up the document which pages will be copied to. Remove the empty section. 
Document dstDoc = new Document();
PageNumberFinder finder = new PageNumberFinder(doc);
// Split nodes which are found across pages.
finder.SplitNodesAcrossPages(true);
// Remove any nodes on pages that are outside our desired page range.
ArrayList sectionsToRemove = new ArrayList();
sectionsToRemove.AddRange(finder.RetrieveAllNodesOnPages(pageEnd + 1, doc.PageCount + 1, NodeType.Section));
sectionsToRemove.AddRange(finder.RetrieveAllNodesOnPages(0, pageStart, NodeType.Section));
foreach (Section section in sectionsToRemove)
    section.Remove();
doc.Save(MyDir + "Out.docx");

dlalwani · November 18, 2013, 10:33pm

Hi Tahir,
Thanks for your assistance on this.
I tried adding the given code. but its not giving final output with the pages for only that fleet.
I tried for 3 fleets 12291, 12296 and 22309. Getting same file in all cases…
Attached are file generated. Realsiized that you creating “dstDoc” but its not been used. Could that be issue.
Code i have use to save as pdf is as something like this.

doc.Save(MyDir + "Out_" + fleetno + ".pdf", SaveFormat.Pdf);

Do you know how will it get pages in output file only for the specific fleet.Thanks again for your help.
Regadrs
Dhiraj

tahir.manzoor · November 19, 2013, 8:19am

Hi Dhiraj,

Thanks for your inquiry. In case you are using an older version of Aspose.Words, I would suggest you please upgrade to the latest version (v13.10.0) from here:
https://releases.aspose.com/words/net

In your case, I suggest you please use the PageSplitter to achieve your requirement. Please check the PageSplitter sample from the offline samples pack.

Please see the following highlighted modified code with PageSplitter. I have attached the PageSplitter code and output Pdf files with this post for your kind reference. Hope this helps you. Please let us know if you have any more queries.

Document doc = new Document(MyDir + "ShieldInsurance.docx");
LayoutCollector layoutCollector = new LayoutCollector(doc);
LayoutEnumerator layoutEnumerator = new LayoutEnumerator(doc);
PageNumberFinder pageFinder = new PageNumberFinder(doc);
string fleetno = "22309";
var collection = doc.GetChildNodes(NodeType.Paragraph, true);
int pageStart = 0, pageEnd = 0;
Boolean blnFirst = true;
foreach (Paragraph para in collection)
{
    if (para.ToString(SaveFormat.Text).Trim().Contains(fleetno))
    {
        var renderObject = layoutCollector.GetEntity(para);
        layoutEnumerator.Current = renderObject;

        int page = layoutEnumerator.PageIndex;
        if (blnFirst)
        {
            pageStart = page;
            blnFirst = false;
        }
        pageEnd = page;
    }
}
// This will build layout model and collect necessary information.
doc.UpdatePageLayout();
// Split nodes in the document into separate pages.
PageSplitter.DocumentPageSplitter splitter = new PageSplitter.DocumentPageSplitter(layoutCollector);
Document finalDoc = splitter.GetDocumentOfPageRange(pageStart, pageEnd);
finalDoc.Save(MyDir + "22309.pdf");

dlalwani · November 20, 2013, 9:30pm

Hi Tahir,

Thanks for assistance so far.

I have tried use the code you have given. I needed to split the file for all fleets in it.

When i deploy this code in our Citrix server. Its taking more than 2 hours and still runing, i could see for some fleets took even more then 20 minutes to search and save file e.g. Fleet : 23401.

Process have not given error but still running for hours.

I have attached here actual log file(PARIS_2013112111270_TCLDKL.html.txt , this is acatually html file , remove txt to see content) and code function file (buildShieldPDFsNew.docx).

Can you please advise if can do something to make faster? Is there any memory issue when going in loop for big chunks becuase earliar few files saved very quickly…

Thanks
Dhiraj

dlalwani · November 20, 2013, 9:55pm

Hi Tahir,

Thanks for assistance so far.

I have tried use the code you have given. I needed to split the file for all fleets in it.

When i deploy this code in our Citrix server. Its taking more than 2 hours and still runing, i could see for some fleets took even more then 20 minutes to search and save file e.g. Fleet : 23401.

Process have not given error but still running for hours.

I have attached here actual log file(PARIS_2013112111270_TCLDKL.html.txt , this is acatually html file , remove txt to see content) and code function file (buildShieldPDFsNew.docx).

Can you please advise if can do something to make faster? Is there any memory issue when going in loop for big chunks because earlier few files saved very quickly… or do i need to release local variables created inside the loop if yes how?

Thanks
Dhiraj

dlalwani · November 20, 2013, 9:57pm

Hi Tahir,

Attaching code function file (buildShieldPDFsNew.docx). here , missed in last post.

Thanks
Dhiraj

dlalwani · November 20, 2013, 11:22pm

Hi Tahir,

One more question. What if we find one fleet in different set of places e.g some fleet i found in Page 10, 14 and 20 (i.e. not continuous). How will Splitter work in that case.

Thanks in advance for your help.

Regards
Dhiraj

dlalwani · November 21, 2013, 12:07am

Hi Tahir,
After adding few loggings i have found that function doc.UpdatePageLayout() is taking more time arond 10 seconds for each fleet(when i tried for just 14 fleets). Is there any workaround for this? or this there any way to make it faster.
Thanks Again
Dhiraj

tahir.manzoor · November 22, 2013, 12:20am

Hi Dhiraj,

Please accept my apologies for late response.

*dlalwani:

When i deploy this code in our Citrix server. Its taking more than 2 hours and still runing, i could see for some fleets took even more then 20 minutes to search and save file e.g. Fleet : 23401.*

I have tested the scenario at Windows 7 for Fleet : 23401 and have not found the shared issue. In case you are using an older version of Aspose.Words, I would suggest you please upgrade to the latest version (v13.10.0) from here and let us know how it goes on your side.

*dlalwani:

One more question. What if we find one fleet in different set of places e.g some fleet i found in Page 10, 14 and 20 (i.e. not continuous). How will Splitter work in that case.*

In this case, you need to get the page numbers as shared in previous code example and save each page as separate document. Once you have separate document for each page e.g 10, 14, 20, please join the document by using Document.AppendDocument method.

*dlalwani:

After adding few loggings i have found that function

doc.UpdatePageLayout()

is taking more time arond 10 seconds for each fleet(when i tried for just 14 fleets). Is there any workaround for this? or this there any way to make it faster.

Can you please advise if can do something to
make faster? Is there any memory issue when going in loop for big chunks
becuase earliar few files saved very quickly…*

I have noticed that you are getting Fleet numbers from database. Please create a standalone/runnable simple application (for example a Console Application Project) that demonstrates the problem at your end. Please create a DataTable with the Fleet numbers which are causing problem.

dlalwani · November 24, 2013, 7:44pm

Hi Tahir,
Thanks for response.
I am trying to use AppendDocument as you advised, but its saving as blank document.
I have used below code for fleet 12291, its ins pages 2, 3, 4 , 18 and 216. However Merged Document is saving as blank. Any idea???

foreach (DataRow drFleet in dsFleets.Tables[0].Rows)
{
    curFleetId =
    Convert.ToInt32(drFleet["fleet_id"]);
    Logger.LogInfo("Starting Extract to find Shield Insurance file for Fleet : " + curFleetId);
    // Start Fleet loop to extract file for each fleet
    string fleetno = "Fleet No:\t" + curFleetId;
    var collection = doc.GetChildNodes(NodeType.Paragraph, true);
    int pageStart = 0, pageEnd = 0;
    Boolean blnFirst = true;
    Document finalDocMerged = new Document();
    foreach (Paragraph para in collection)
    {
        if (curFleetId == 12291)
        {
            if (para.ToString(SaveFormat.Text).Trim().Contains(fleetno))
            {
                var renderObject = layoutCollector.GetEntity(para);
                layoutEnumerator.Current = renderObject;
                int page = layoutEnumerator.PageIndex;
                // if (blnFirst)
                if (blnFirst)
                {
                    finalDocMerged.Save(
                    ConfigurationSettings.AppSettings["ShieldSplitFilePath"].ToString().Trim() + (txtWeekNo.IntVal + 200000).ToString() + "\\" + "ShieldInsurance_" + (txtWeekNo.IntVal + 200000).ToString() + "_" + curFleetId + ".pdf", SaveFormat.Pdf);
                    pageStart = page;
                    blnFirst =
                    false;
                }
                pageEnd = page;
                // Save document for current page
                Logger.LogInfo("##### Found Fleet : " + curFleetId + " in Shield File in : " + page + "#####");
                // Split nodes in the document into separate pages.
                PageSplitter.
                DocumentPageSplitter splitter = new PageSplitter.DocumentPageSplitter(layoutCollector);
                Logger.LogInfo("&&&&& Created Page Splitter for Fleet : " + curFleetId + " and Page No : " + page + " &&&&&");
                Document finalDoc = splitter.GetDocumentOfPageRange(page, page);
                Logger.LogInfo("@@@@@ Created final doc Fleet : " + curFleetId + " and Page No : " + page + "@@@@@");
                finalDoc.Save(
                ConfigurationSettings.AppSettings["ShieldSplitFilePath"].ToString().Trim() + (txtWeekNo.IntVal + 200000).ToString() + "\\" + "tmp_ShieldInsurance_" + (txtWeekNo.IntVal + 200000).ToString() + "_" + curFleetId + ".pdf", SaveFormat.Pdf);
                finalDocMerged.AppendDocument(finalDoc,
                ImportFormatMode.UseDestinationStyles);
                finalDocMerged.Save(
                ConfigurationSettings.AppSettings["ShieldSplitFilePath"].ToString().Trim() + (txtWeekNo.IntVal + 200000).ToString() + "\\" + "ShieldInsurance_" + (txtWeekNo.IntVal + 200000).ToString() + "_" + curFleetId + ".pdf", SaveFormat.Pdf);
                Logger.LogInfo("====== Saved File for Fleet : " + curFleetId + " and Page No : " + page + "======");
            }
        }
    }
    if (pageStart > 0 && pageEnd > 0)
    {
        // This will build layout model and collect necessary information.
        // doc.UpdatePageLayout();
        // Logger.LogInfo("--- Updated Page Layout for Fleet : " + curFleetId + " ---");
        shieldFleetCount = shieldFleetCount + 1;
    }
    else
    {
        Logger.LogInfo("*********** NOT Found Fleet : " + curFleetId + " in Shield File.***********");
    }
    fleetCount = fleetCount + 1;
    progressBar1.PerformStep();
    progressBar1.Update();
}
// End For Fleet loop

dlalwani · November 24, 2013, 8:22pm

Hi Tahir,
Attached here file for last post, where i can see fleet 12291 in pages 3,4,5 , 18 and 216 and its saving as Blank final document when use AppendDocument. Files are saving indivizually for each page but i needed finally all in one pdf… Please advise asap.
Thanks

tahir.manzoor · November 25, 2013, 4:40am

Hi Dhiraj,

Thanks
for your inquiry. Please use the following code example to achieve your requirements and let us know if you have any more queries.

Document doc = new Document(MyDir + "ShieldInsurance.docx");
LayoutCollector layoutCollector = new LayoutCollector(doc);
LayoutEnumerator layoutEnumerator = new LayoutEnumerator(doc);
PageNumberFinder pageFinder = new PageNumberFinder(doc);
string fleetno = "12291";
var collection = doc.GetChildNodes(NodeType.Paragraph, true);
Document finalDocument = new Document();
finalDocument.RemoveAllChildren();
foreach (Paragraph para in collection)
{
    if (para.ToString(SaveFormat.Text).Trim().Contains(fleetno))
    {
        var renderObject = layoutCollector.GetEntity(para);
        layoutEnumerator.Current = renderObject;
        int page = layoutEnumerator.PageIndex;
        // Split nodes in the document into separate pages.
        PageSplitter.DocumentPageSplitter splitter = new PageSplitter.DocumentPageSplitter(layoutCollector);
        Document tempDoc = splitter.GetDocumentOfPage(page);
        finalDocument.AppendDocument(tempDoc, ImportFormatMode.KeepSourceFormatting);
    }
}
finalDocument.Save(MyDir + "12291.pdf");
finalDocument.Save(MyDir + "12291.docx");

dlalwani · June 30, 2014, 12:49am

Hi Tahir/Aspose Team,

We had this conversation in late last year. After that we have been using Aspose.Word and written code as suggested to split the word file into separate PDFs.

Program has been running well so far. But for last few times when it doing split for some reason file generated for first fleet 12289 is not getting all pages from file. Please attached sources word file (ShieldInsurance(2).docx) and splitted .pdf file generated for fleet 12289 (incorrectly) . Also attached pdf file for fleet 12296 is generated correctly.

Problem is Source word file has 3 pages for fleet (search keyword) 12289 but when program runs it graps only one page. It works fine for all other fleets e.g. 12296 has been generated correctly with 2 pages.

Below is the currently used code.

private void buildShieldPDF()
{
    try
    {
        Aspose.Words.License license = new Aspose.Words.License();
        license.SetLicense("Aspose.Words.lic");
        Document doc = new Document(ConfigurationSettings.AppSettings["FileLocationPath"].ToString().Trim() + "ShieldInsurance.docx");
        LayoutCollector layoutCollector = new LayoutCollector(doc);
        LayoutEnumerator layoutEnumerator = new LayoutEnumerator(doc);
        PageNumberFinder pageFinder = new PageNumberFinder(doc);
        int curFleetId;
        // Get Fleets List
        string sql;
        int fleetId = 0;
        int fleetCount;
        int shieldFleetCount;
        fleetCount = 0;
        shieldFleetCount = 0;
        sql = "SQL to get fleet list from table…";
        IDataAdapter daFleets = AppConfig.IngresProvider.GetDataAdapter(sql, AppConfig.IngresConnection);
        DataSet dsFleets = new DataSet();
        daFleets.Fill(dsFleets);
        Logger.LogInfo("No. of rows found : " + dsFleets.Tables[0].Rows.Count);
        progressBar1.Minimum = 0;
        progressBar1.Maximum = dsFleets.Tables[0].Rows.Count;
        progressBar1.Step = 1;
        if (dsFleets.Tables[0].Rows.Count > 0)
        {
            progressBar1.Visible = true;
            foreach (DataRow drFleet in dsFleets.Tables[0].Rows)
            {
                curFleetId = Convert.ToInt32(drFleet["fleet_id"]);
                Logger.LogInfo("Starting Extract to find Shield Insurance file for Fleet : " + curFleetId);
                // Start Fleet loop to extract file for each fleet
                string fleetno = "Fleet No:\t" + curFleetId;
                var collection = doc.GetChildNodes(NodeType.Paragraph, true);
                int pageStart = 0, pageEnd = 0;
                Boolean blnFirst = true;
                Document finalDocMerged = new Document();
                finalDocMerged.RemoveAllChildren();
                foreach (Paragraph para in collection)
                {
                    if (para.ToString(SaveFormat.Text).Trim().Contains(fleetno))
                    {
                        var renderObject = layoutCollector.GetEntity(para);
                        layoutEnumerator.Current = renderObject;
                        int page = layoutEnumerator.PageIndex;
                        if (blnFirst)
                        {
                            pageStart = page;
                            blnFirst = false;
                        }
                        pageEnd = page;
                        // Split nodes in the document into separate pages.
                        PageSplitter.DocumentPageSplitter splitter = new PageSplitter.DocumentPageSplitter(layoutCollector);
                        Document tempDoc = splitter.GetDocumentOfPage(page);
                        finalDocMerged.AppendDocument(tempDoc, ImportFormatMode.KeepSourceFormatting);
                    }
                }
                if (pageStart > 0 && pageEnd > 0)
                {
                    shieldFleetCount = shieldFleetCount + 1;
                    finalDocMerged.Save(ConfigurationSettings.AppSettings["ShieldSplitFilePath"].ToString().Trim() + (txtWeekNo.IntVal + 200000).ToString() + "\\" + "ShieldInsurance_" + (txtWeekNo.IntVal + 200000).ToString() + "_" + curFleetId + ".pdf");
                }
                else
                {
                    Logger.LogInfo("*********** NOT Found Fleet : " + curFleetId + " in Shield File.***********");
                }
                fleetCount = fleetCount + 1;
            }//End For Fleet loop
            MessageBox.Show("Shield Insurance File Splitted Successfully for " + shieldFleetCount + " fleets");
        }
        else
        {
            progressBar1.Visible = false;
            MessageBox.Show("No Fleet found to extract Shield Insurance File.");
        }
    }
    catch (Exception e)
    {
        progressBar1.Visible = false;
        MessageBox.Show("Some Error Occured in splitting Shield Insurance File(s)");
    }
}

tahir.manzoor · June 30, 2014, 12:09pm

Hi there,

Thanks for your inquiry.

I suggest you please use the latest version of Aspose.Words for .NET 14.5.0. Please use the following modified code. Hope this helps you. Please let us know if you have any more queries.

Document doc = new Document(MyDir + "ShieldInsurance.docx");
LayoutCollector layoutCollector = new LayoutCollector(doc);
LayoutEnumerator layoutEnumerator = new LayoutEnumerator(doc);
// Split nodes in the document into separate pages.
DocumentPageSplitter splitter = new DocumentPageSplitter(layoutCollector);
string fleetno = "12289";
var collection = doc.GetChildNodes(NodeType.Paragraph, true);
Document finalDocument = new Document();
finalDocument.RemoveAllChildren();
foreach (Paragraph para in collection)
{
    if (para.ToString(SaveFormat.Text).Trim().Contains(fleetno))
    {
        var renderObject = layoutCollector.GetEntity(para);
        layoutEnumerator.Current = renderObject;
        int page = layoutEnumerator.PageIndex;
        Document tempDoc = splitter.GetDocumentOfPage(page);
        finalDocument.AppendDocument(tempDoc, ImportFormatMode.KeepSourceFormatting);
    }
}
finalDocument.Save(MyDir + "12289.pdf");
finalDocument.Save(MyDir + "12289.docx");