Fri Dec 03, 2021 1:41 am
Login Register Lost Password? Contact Us

Spraying Word document in HPCC THOR

Post questions or comments on how best to manage your big data problem

Tue Apr 01, 2014 8:31 am Change Time Zone

Dear Team,

We are looking for below business case.
- Upload/Spray multiple Word documents(Unstructured data) in distributed system
- Provide UI to achieve below features
- Searching on a word or Sentence should return the matching section/paragraph from the sprayed document

My idea is to use HPCC. As far as my knowledge on HPCC, I can spray and build index of the uploaded data in THOR and build a Query in Roxie to achieve Search functionality.
Please confirm on my above understanding.

My question is on whether HPCC will support uploading/spraying word document(s).
Posts: 9
Joined: Wed Oct 30, 2013 9:22 am

Tue Apr 01, 2014 1:47 pm Change Time Zone

Did a quick test on my machine, Even though HPCC does not complain while uploading and spraying word documents, the populated dataset (after reading the sprayed file) contains lot of unnecessary/Junk data that you dont want.

If possible can you convert word documents into a txt file and work on them

Posts: 66
Joined: Wed Oct 05, 2011 10:09 am

Tue Apr 01, 2014 2:31 pm Change Time Zone

Before you upload Word documents, first convert them to plain text (Word can do that for you).

Or, on the web, you can find C++ code that converts Word documents to plain text. And, you can make a ECL function for this code, by wrapping it with BEGINC++ and ENDC++ (please read the ECL Language Reference for the details).
Posts: 260
Joined: Mon May 07, 2012 6:23 pm

Wed Dec 31, 2014 5:26 am Change Time Zone

I have about 10 files that are related by a 3 field composite key (roughly 45 bytes total). In the RDBMS world I would be inclined to convert the natural key into a numeric surrogate key to reduce the footprint and hopefully improve sorts and joins. The 3 key fields constitute 20-30% of the total data size.

You can easily check out our high quality itil which prepares you well for the ccent questions You can also get success in real Test-king Certification exam with the quality and University of Saint Joseph and best of luck.
Last edited by faarisuman on Wed Jan 14, 2015 5:33 am, edited 1 time in total.
Posts: 1
Joined: Wed Dec 31, 2014 5:23 am

Wed Dec 31, 2014 8:10 pm Change Time Zone


That sounds like a reasonable approach to use in HPCC, too.


Community Advisory Board Member
Community Advisory Board Member
Posts: 1606
Joined: Wed Oct 26, 2011 7:40 pm

Return to Managing Big Data

Who is online

Users browsing this forum: No registered users and 1 guest