Thu Jul 19, 2018 5:18 am
Login Register Lost Password? Contact Us


FROMJSON and Abstract fields

Comments and questions related to the Enterprise Control Language

Mon Apr 02, 2018 1:00 pm Change Time Zone

Hello, I'm sorry if this has been posted before, but when searching for answers I found no results which matched exactly what I want.

I'm trying to load a json dataset which contains many nested structures inside it, more specifically the Yelp challenge dataset (https://www.yelp.com/dataset/challenge).

A sample of a record would be something like the following:

Code: Select all
{
   "business_id":"FYWN1wneV18bWNgQjJ2GNg",
   "name":"Dental by Design",
   "neighborhood":"",
   "address":"4855 E Warner Rd, Ste B9",
   "city":"Ahwatukee",
   "state":"AZ",
   "postal_code":"85044",
   "latitude":33.3306902,
   "longitude":-111.9785992,
   "stars":4.0,
   "review_count":22,
   "is_open":1,
   "attributes":{
      "AcceptsInsurance":true,
      "ByAppointmentOnly":true,
      "BusinessAcceptsCreditCards":true
   },
   "categories":[
      "Dentists",
      "General Dentistry",
      "Health & Medical",
      "Oral Surgeons",
      "Cosmetic Dentists",
      "Orthodontists"
   ],
   "hours":{
      "Friday":"7:30-17:00",
      "Tuesday":"7:30-17:00",
      "Thursday":"7:30-17:00",
      "Wednesday":"7:30-17:00",
      "Monday":"7:30-17:00"
   }
}



The issue I'm currently facing is: I can't seem to be able to load the nested hours unless I specifically specify the 'day' of the week for them.

I'd like something like this:
Code: Select all
htest := RECORD
  STRING weekday;
  STRING hours;
END;

lcategory := RECORD
  STRING category;
END;

namesRec := RECORD 
  UNSIGNED2 EmployeeID{xpath('EmpID')}; 
  STRING10 Firstname{xpath('FName')}; 
  STRING10 Lastname{xpath('LName')};
  DATASET(htest) hHours{xpath('hours')};
  DATASET(lcategory) NCategory{xpath('categories')};
  SET OF STRING SSCategory{xpath('categories')};


And I'd expect the days such as 'monday', 'tuesday', etc to come under the field 'weekday', while the hours would come under the field 'hours'.

But, while this syntax does compile, I get empty fields as a result.

Is it possible to load data like this/ what should be changed to achieve that effect?

For more info: I've verified that I can successfully retrieve the hour for a specific day by having a RECORD structure such as:
Code: Select all
test := RECORD
  STRING friday{xpath('Friday')};
END;


But that would render me with a field for each day of the week, instead of just a 'weekdays' field, which is something I'd not want.
This issue would be even bigger when trying to map a field such as 'attributes', which basically contains a multitude of different possible keys.


Thanks,
Felipe.
fgbulsoni
 
Posts: 2
Joined: Mon Apr 02, 2018 12:34 pm

Tue Apr 03, 2018 10:40 am Change Time Zone

Felipe,

Can you please attach a file containing the first half dozen or so records from the actual JSON file?

Thanks,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1366
Joined: Wed Oct 26, 2011 7:40 pm

Tue Apr 03, 2018 11:30 am Change Time Zone

Hey Richard, thank you for the reply!

And sure, here's a file containing the first half dozen records of the dataset:
business_sample.txt
First half dozen records from business sample
(5.7 KiB) Downloaded 21 times


Thanks,
Felipe
fgbulsoni
 
Posts: 2
Joined: Mon Apr 02, 2018 12:34 pm

Wed Apr 04, 2018 1:43 pm Change Time Zone

Felipe,

Not every XML or JSON file is going to be easily definable in ECL, and this is one of those. The primary mission of any XML or JSON file processing is to extract the relevant data from the file and get it into a format that is native to HPCC and can work efficiently. Here's my code that does that:
Code: Select all
IMPORT Std;
//start by getting the most data possible directly from the JSON file
// and defining the max sizes of all the fields it's possible to determine:
Layout := RECORD
  STRING22 business_id{xpath('business_id')};
  STRING40 name{xpath('name')};
  STRING40 address{xpath('address')};
  STRING1  neighborhood{xpath('neighborhood')};
  STRING20 city{xpath('city')};
  STRING2  state{xpath('state')};
  STRING5  postal_code{xpath('postal_code')};
  STRING12 latitude{xpath('latitude')};
  STRING12 longitude{xpath('longitude')};
  STRING4  stars{xpath('stars')};
  STRING4  review_count{xpath('review_count')};
  STRING1  is_open{xpath('is_open')};
  STRING11 hours1{xpath('hours/Monday')};
  STRING11 hours2{xpath('hours/Tuesday')};
  STRING11 hours3{xpath('hours/Wednesday')};
  STRING11 hours4{xpath('hours/Thursday')};
  STRING11 hours5{xpath('hours/Friday')};
  STRING11 hours6{xpath('hours/Saturday')};
  STRING11 hours7{xpath('hours/Sunday')};
  SET OF STRING SSCategory{xpath('categories')};   
  STRING   attributes;
END;

JSONds := DATASET('~rttest::json::yelp::business_sample-fixed.txt',Layout,JSON('/'));

//re-define the same logical file as a CSV DATASET
// for simple parsing of the Atttributes tag contents of each record
rec := {STRING line};
CSVds := DATASET('~rttest::json::yelp::business_sample-fixed.txt',rec,
                 CSV(SEPARATOR('')));
JustData := CSVds(line[1] NOT IN ['[',']']); //remove "empty" recs
// JustData;

//extract just the business_id and Attributes tag content:
PrjRec := RECORD
  STRING business_id;
  STRING Attributes;
END;
AttrTbl := PROJECT(JustData,
                   TRANSFORM(PrjRec,
                             SELF.business_id := LEFT.line[18..39];
                             AttrPos := Std.Str.Find(LEFT.line,'"attributes"',1);
                             CatPos  := Std.Str.Find(LEFT.line,'"categories"',1);
                             SELF.Attributes := LEFT.line[AttrPos+15 .. CatPos-1]));
AttrTbl; //business_id to join to JSON data, and Attributes to post-process

//then JOIN to the JSONds to fill in the Attributes field
AttrDS := JOIN(JSONds,AttrTbl,LEFT.business_id=RIGHT.business_id,
               TRANSFORM(Layout,SELF.Attributes := RIGHT.Attributes,SELF := LEFT));
//and write the result to disk file
OUTPUT(AttrDS,ALL);
Note that the Categories and Attributes data will require post-processing. The Categories, being a simple comma-delimited set of strings, is fairly straight-forward, as are the hours fields.

The Attributes data will require more complex parsing. It would be possible to define this all in the RECORD structure using XPATH, but that would mean first determining all the unique tag names within it for all the records, and any nested child datasets for any repeating structures (and there are some). That's why I just parsed out the content, and you can post-process that string any way you need to extract whatever data is relevant to your product.

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1366
Joined: Wed Oct 26, 2011 7:40 pm


Return to ECL

Who is online

Users browsing this forum: No registered users and 1 guest