Duplicate content extracted from PPT table

CBedard · July 1, 2009, 11:56am

Hello,

In the attached PPT document, we have two slides with one table each. When we extract content from thess tables, we get a lot of duplicate text.

The first table seems to have a matrix of 6 columns x 17 rows. Visually it only has 5 columns (we suspect that there a merged/hidden column at index 2). So we have a duplicate content from this hidden column. More strikingly, whenever we have empty/merged cells in some rows, we get duplicate text from these “empty” cells.

Our program really looks at each item in the table by its row/column coordinates, as shown in the accompanying .DOC file.

Is this a bug, or is there a way to better navigate a table by considering individual cell properties to skip irrelevant ones?

alcrus · July 1, 2009, 12:21pm

Hello,

If you need only extract text without changing any table’s properties it’s better to cast Table to GroupShape and iterate all shapes inside. Find all Rectangles and extract text. By the way, it should work much quicker. Cells usually saved in ppt format in reverse order so you should start iteration from the last shape.