Slow performance while extracting items from PST

zach.evans · July 31, 2019, 7:46pm

java version “1.8.0_181”
Java™ SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot™ 64-Bit Server VM (build 25.181-b13, mixed mode)

Aspose Email for Java 19.6

We are using Aspose Email for Java to extract items from a PST as MSGs. We are using the PersonalStorage.saveMessageToFile API to save the items as MSGs to disk. We have encountered a few cases where some PSTs take a long to save items. We are seeing averages around 1 second per item with a max as high as 20 seconds. We do not believe it is due to disk IO as IOPS are pretty low. We did notice high load averages on a single CPU core. The size of these PSTs are anywhere from 10GB to 30GB containing in the low 100s of thousands items. I’m aware that large PSTs can have fragmentation issues and that the index in the PST can become less than optimised. I’m wondering if there might be some improvements to be had in these cases. Does Aspose Email provide a “recovery” mechanism that might improve performance? We noticed running Outlook ScanPST on these PSTs before extracting the items as MSGs significantly improves the performance. Or another mechanism available in the API? For comparison, we are able to extract items from these PSTs using Outlook Redemption without these performance issues. We are unable to share the PST files that exhibit this issue but we are attempting to produce a synthetic example we can share.

We are using pretty much the same code found in this example but are using PersonalStorage.saveMessageToFile instead of instantiating an instance of MapiMessage. We observed this method was also slow and using PersonalStorage.saveMessageToFile was slightly faster.

In the the mean time is there any guidance you can offer to help improve performance here? We appreciate the help.

Adnan.Ahmad · August 1, 2019, 12:41am

@tucker.barbour,

Can you please share source file along with sample code so that we may further investigate to help you out.

russ.nichols · August 16, 2019, 3:52pm

Just wanted to share our experience on a similar case.

We parallelized traversal of the PST by concurrently running the API
MessageInfoCollection messageInfoCollection = folder.GetContents(firstMessage, count);
for subset of the PST (i.e. extracting 1000 message ID at the time)
but for fragmented PST we noticed that when accessing the last messages in a folder,
the API would still touch the entire PST

i.e. we noticed 10K+ seeks to a 30GB PST to get its last 1000 messages on a folder that
had 180000 messages in total.

So our problem would be solved, if there was a way to simply not seek the whole PST and
just jump to the offset where the requested messages are located

mudassir.fayyaz · August 16, 2019, 7:07pm

@russ.nichols,

I have observed the information shared by you and request you to please provide a working sample project along with source file and snapshot verifying the requirements. We will be able to investigate the issue further on our end on provision of requested information and request you to please share the requested information.

zach.evans · October 22, 2019, 12:56pm

I implemented a parallelised traversal of the PST which showed a significant performance improvement–though I’m still not sure why. Basically, for each folder we calculate a set of offsets representing a range of items in each folder. Then allocate each offset to a thread. The thread will then open a new file handle to the PST, via PersonalStorage.fromFile, and seek to the provided offset via Folder.getContents(index, size). Again, I’m not sure why this results in faster execution but I image opening a new file handle and seeking immediately to the offset we’re interested in helps in some way.

mudassir.fayyaz · October 22, 2019, 4:58pm

@tucker.barbour,

I have observed your comments. Is seems that you have devised a mechanism that is working fine on your end and you have shared your experience with us and others for future reference.