Sat Nov 27, 2021 12:20 am
Login Register Lost Password? Contact Us


Spray Delimited Not Parsing Fields

Comments or questions specific to the features of ECL Watch

Mon Aug 04, 2014 2:27 pm Change Time Zone

Hello,

I am trying to spray a CSV file. When I spray the file and look at it in ECL watch, I see only two fields: "line", and "__fileposition__". Once I read it in through ECL and Output the dataset again, the fields are parsed. Another thing I've noticed is that the original file I spray is 9,806 bytes, while the re-sprayed file is 78,930 bytes.

Do you have any what could cause this behavior? Can I make it so that the fields are parsed on the first spray?

-Fred

Extra materials:
1) A sample of 20 lines from the file I'm trying to spray: 2012_01.20lines.csv.txt. I had to add the extension .txt to satisfy the forum system.
2) The ECL code I use to "re-spray" the file, which causes the output file to be parsed:

import $;
OUTPUT(DISTRIBUTE($.tweets, HASH32($.tweets.id_str)) ,, 'all6');

3) A screenshot of the initial "two-field" configuration: http://i.imgur.com/0vVfBlC.png.
Attachments
2012_01.20lines.csv.txt
(9.58 KiB) Downloaded 433 times
fmorstatter
 
Posts: 10
Joined: Thu Jun 05, 2014 8:28 pm

Tue Aug 05, 2014 12:25 pm Change Time Zone

Hello,

When I sprayed your sample using " as the Quote character, I was able to read in your fields of your sample file correctly.

If you are using the 5.0 ECL Watch, the Delimited Spray Option has a check box that says "Record Structure Present". Checking that box for the spray yielded the following result from your sample file:

Code: Select all
RECORD
    STRING created_at;
    STRING entities_user_mentions;
    STRING entities_hashtags;
    STRING entities_urls;
    STRING favorite_count;
    STRING favorited;
    STRING filter_level;
    STRING geotagged;
    STRING id_str;
    STRING in_reply_to_screen_name;
    STRING in_reply_to_status_id_str;
    STRING in_reply_to_user_id_str;
    STRING lang;
    STRING lat;
    STRING lng;
    STRING place_country;
    STRING place_country_code;
    STRING place_full_name;
    STRING place_id;
    STRING place_name;
    STRING place_place_type;
    STRING place_url;
    STRING possibly_sensitive;
    STRING retweet_count;
    STRING retweeted;
    STRING retweeted_status_created_at;
    STRING retweeted_status_entities_user_mentions;
    STRING retweeted_status_entities_hashtags;
    STRING retweeted_status_entities_urls;
    STRING retweeted_status_favorite_count;
    STRING retweeted_status_favorited;
    STRING retweeted_status_geotagged;
    STRING retweeted_status_id_str;
    STRING retweeted_status_in_reply_to_screen_name;
    STRING retweeted_status_in_reply_to_status_id_str;
    STRING retweeted_status_in_reply_to_user_id_str;
    STRING retweeted_status_lang;
    STRING retweeted_status_lat;
    STRING retweeted_status_lng;
    STRING retweeted_status_place_country;
    STRING retweeted_status_place_country_code;
    STRING retweeted_status_place_full_name;
    STRING retweeted_status_place_id;
    STRING retweeted_status_place_name;
    STRING retweeted_status_place_place_type;
    STRING retweeted_status_place_url;
    STRING retweeted_status_possibly_sensitive;
    STRING retweeted_status_retweet_count;
    STRING retweeted_status_retweeted;
    STRING retweeted_status_source;
    STRING retweeted_status_text;
    STRING retweeted_status_truncated;
    STRING retweeted_status_user_contributors_enabled;
    STRING retweeted_status_user_created_at;
    STRING retweeted_status_user_default_profile;
    STRING retweeted_status_user_default_profile_image;
    STRING retweeted_status_user_description;
    STRING retweeted_status_user_favourites_count;
    STRING retweeted_status_user_followers_count;
    STRING retweeted_status_user_following;
    STRING retweeted_status_user_friends_count;
    STRING retweeted_status_user_geo_enabled;
    STRING retweeted_status_user_id_str;
    STRING retweeted_status_user_is_translation_enabled;
    STRING retweeted_status_user_is_translator;
    STRING retweeted_status_user_lang;
    STRING retweeted_status_user_listed_count;
    STRING retweeted_status_user_location;
    STRING retweeted_status_user_name;
    STRING retweeted_status_user_notifications;
    STRING retweeted_status_user_profile_background_color;
    STRING retweeted_status_user_profile_background_image_url;
    STRING retweeted_status_user_profile_background_image_url_https;
    STRING retweeted_status_user_profile_background_tile;
    STRING retweeted_status_user_profile_banner_url;
    STRING retweeted_status_user_profile_image_url;
    STRING retweeted_status_user_profile_link_color;
    STRING retweeted_status_user_profile_sidebar_border_color;
    STRING retweeted_status_user_profile_sidebar_fill_color;
    STRING retweeted_status_user_profile_text_color;
    STRING retweeted_status_user_profile_use_background_image;
    STRING retweeted_status_user_protected;
    STRING retweeted_status_user_screen_name;
    STRING retweeted_status_user_statuses_count;
    STRING retweeted_status_user_time_zone;
    STRING retweeted_status_user_url;
    STRING retweeted_status_user_utc_offset;
    STRING retweeted_status_user_verified;
    STRING source;
    STRING text;
    STRING truncated;
    STRING user_contributors_enabled;
    STRING user_created_at;
    STRING user_default_profile;
    STRING user_default_profile_image;
    STRING user_description;
    STRING user_favourites_count;
    STRING user_follow_request_sent;
    STRING user_followers_count;
    STRING user_following;
    STRING user_friends_count;
    STRING user_geo_enabled;
    STRING user_id_str;
    STRING user_is_translation_enabled;
    STRING user_is_translator;
    STRING user_lang;
    STRING user_listed_count;
    STRING user_location;
    STRING user_name;
    STRING user_notifications;
    STRING user_profile_background_color;
    STRING user_profile_background_image_url;
    STRING user_profile_background_tile;
    STRING user_profile_image_url;
    STRING user_profile_link_color;
    STRING user_profile_sidebar_border_color;
    STRING user_profile_sidebar_fill_color;
    STRING user_profile_text_color;
    STRING user_profile_use_background_image;
    STRING user_protected;
    STRING user_screen_name;
    STRING user_statuses_count;
    STRING user_time_zone;
    STRING user_url;
    STRING user_utc_offset;
    STRING user_verified;
    STRING field127;
END;


Now in versions prior to 5.0, when you look at the Details of the sprayed sample, you will see Field1, Field2, ....Field127 in the ECL Watch details instead.

I am checking to see if there was an option in the DFUPlus utility that enabled this result.

Regards,

Bob
bforeman
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1005
Joined: Wed Jun 29, 2011 7:13 pm

Tue Aug 05, 2014 2:35 pm Change Time Zone

fmorstatter,
When I spray the file and look at it in ECL watch, I see only two fields: "line", and "__fileposition__".
Spray is a "dumb" operation. Its mission is to get the data onto your cluster as quickly as possible and its only real intelligence is to make sure that a single record never spans multiple nodes.

When you are spraying a CSV file, the Spray operation itself doesn't know or care what the field structure of the file is. Therefore, when you use ECL Watch to "View Data File" the DFU has no metadata about the field structure, which is why you see the data just as "Line" and "fileposition."

As Bob pointed out, in 5.0 the Delimited spray (AKA: CSV) now has the option of reading the first record for the field names and giving you a RECORD structure that you can copy and use in your ECL code that works with that data, saving you having to type it all in, but it still does not put that information into the DFU's metadata about the file.

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1604
Joined: Wed Oct 26, 2011 7:40 pm


Return to ECL Watch

Who is online

Users browsing this forum: No registered users and 1 guest

cron