Spraying Word document in HPCC THOR
Dear Team,
We are looking for below business case.
- Upload/Spray multiple Word documents(Unstructured data) in distributed system
- Provide UI to achieve below features
- Searching on a word or Sentence should return the matching section/paragraph from the sprayed document
My idea is to use HPCC. As far as my knowledge on HPCC, I can spray and build index of the uploaded data in THOR and build a Query in Roxie to achieve Search functionality.
Please confirm on my above understanding.
My question is on whether HPCC will support uploading/spraying word document(s).
We are looking for below business case.
- Upload/Spray multiple Word documents(Unstructured data) in distributed system
- Provide UI to achieve below features
- Searching on a word or Sentence should return the matching section/paragraph from the sprayed document
My idea is to use HPCC. As far as my knowledge on HPCC, I can spray and build index of the uploaded data in THOR and build a Query in Roxie to achieve Search functionality.
Please confirm on my above understanding.
My question is on whether HPCC will support uploading/spraying word document(s).
- rajesh.dorairaj
- Posts: 9
- Joined: Wed Oct 30, 2013 9:22 am
Did a quick test on my machine, Even though HPCC does not complain while uploading and spraying word documents, the populated dataset (after reading the sprayed file) contains lot of unnecessary/Junk data that you dont want.
If possible can you convert word documents into a txt file and work on them
Regards,
Sameer
If possible can you convert word documents into a txt file and work on them
Regards,
Sameer
- sameermsc
- Posts: 66
- Joined: Wed Oct 05, 2011 10:09 am
Before you upload Word documents, first convert them to plain text (Word can do that for you).
Or, on the web, you can find C++ code that converts Word documents to plain text. And, you can make a ECL function for this code, by wrapping it with BEGINC++ and ENDC++ (please read the ECL Language Reference for the details).
Or, on the web, you can find C++ code that converts Word documents to plain text. And, you can make a ECL function for this code, by wrapping it with BEGINC++ and ENDC++ (please read the ECL Language Reference for the details).
- tlhumphrey2
- Posts: 260
- Joined: Mon May 07, 2012 6:23 pm
I have about 10 files that are related by a 3 field composite key (roughly 45 bytes total). In the RDBMS world I would be inclined to convert the natural key into a numeric surrogate key to reduce the footprint and hopefully improve sorts and joins. The 3 key fields constitute 20-30% of the total data size.
______________
You can easily check out our high quality itil which prepares you well for the ccent questions You can also get success in real Test-king Certification exam with the quality www.quincy.edu and University of Saint Joseph and best of luck.
______________
You can easily check out our high quality itil which prepares you well for the ccent questions You can also get success in real Test-king Certification exam with the quality www.quincy.edu and University of Saint Joseph and best of luck.
Last edited by faarisuman on Wed Jan 14, 2015 5:33 am, edited 1 time in total.
- faarisuman
- Posts: 1
- Joined: Wed Dec 31, 2014 5:23 am
faari,
That sounds like a reasonable approach to use in HPCC, too.
HTH,
Richard
That sounds like a reasonable approach to use in HPCC, too.
HTH,
Richard
- rtaylor
- Community Advisory Board Member
- Posts: 1576
- Joined: Wed Oct 26, 2011 7:40 pm
5 posts
• Page 1 of 1
Who is online
Users browsing this forum: No registered users and 1 guest