Mon Jul 04, 2022 6:35 am
Login Register Lost Password? Contact Us

Please Note: The HPCC Systems forums are moving to Stack Overflow. We invite you to post your questions on Stack Overflow utilizing the tag hpcc-ecl (https://stackoverflow.com/search?tab=newest&q=hpcc-ecl). This legacy forum will be active and monitored during our transition to Stack Overflow but will become read only beginning September 1, 2022.



Spraying Word document in HPCC THOR

Post questions or comments on how best to manage your big data problem

Tue Apr 01, 2014 8:31 am Change Time Zone

Dear Team,

We are looking for below business case.
- Upload/Spray multiple Word documents(Unstructured data) in distributed system
- Provide UI to achieve below features
- Searching on a word or Sentence should return the matching section/paragraph from the sprayed document

My idea is to use HPCC. As far as my knowledge on HPCC, I can spray and build index of the uploaded data in THOR and build a Query in Roxie to achieve Search functionality.
Please confirm on my above understanding.

My question is on whether HPCC will support uploading/spraying word document(s).
rajesh.dorairaj
 
Posts: 9
Joined: Wed Oct 30, 2013 9:22 am

Tue Apr 01, 2014 1:47 pm Change Time Zone

Did a quick test on my machine, Even though HPCC does not complain while uploading and spraying word documents, the populated dataset (after reading the sprayed file) contains lot of unnecessary/Junk data that you dont want.

If possible can you convert word documents into a txt file and work on them

Regards,
Sameer
sameermsc
 
Posts: 66
Joined: Wed Oct 05, 2011 10:09 am

Tue Apr 01, 2014 2:31 pm Change Time Zone

Before you upload Word documents, first convert them to plain text (Word can do that for you).

Or, on the web, you can find C++ code that converts Word documents to plain text. And, you can make a ECL function for this code, by wrapping it with BEGINC++ and ENDC++ (please read the ECL Language Reference for the details).
tlhumphrey2
 
Posts: 260
Joined: Mon May 07, 2012 6:23 pm

Wed Dec 31, 2014 5:26 am Change Time Zone

I have about 10 files that are related by a 3 field composite key (roughly 45 bytes total). In the RDBMS world I would be inclined to convert the natural key into a numeric surrogate key to reduce the footprint and hopefully improve sorts and joins. The 3 key fields constitute 20-30% of the total data size.


______________
You can easily check out our high quality itil which prepares you well for the ccent questions You can also get success in real Test-king Certification exam with the quality www.quincy.edu and University of Saint Joseph and best of luck.
Last edited by faarisuman on Wed Jan 14, 2015 5:33 am, edited 1 time in total.
faarisuman
 
Posts: 1
Joined: Wed Dec 31, 2014 5:23 am

Wed Dec 31, 2014 8:10 pm Change Time Zone

faari,

That sounds like a reasonable approach to use in HPCC, too.

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1619
Joined: Wed Oct 26, 2011 7:40 pm


Return to Managing Big Data

Who is online

Users browsing this forum: No registered users and 1 guest

cron