Mon Oct 18, 2021 7:54 am
Login Register Lost Password? Contact Us


Creating numbers from unicode data

Questions around writing code and queries

Mon Jun 27, 2011 11:51 pm Change Time Zone

I'm having trouble creating numeric types from unicode data. I'm running the 3.0.0.2 VM through VMware Player on Windows 7 Pro 64 bit. I sprayed a UTF-16BE CSV file to my cluster, and was able to output a record with DATA and UNICODE fields in the ECL IDE. When I tried to change the field types to INTEGER or REAL, only zeroes showed up in the results window. I tried uploading UTF-8 data with the same results. So I tried spraying an ASCII file to use with the same ECL code, and then the numbers show up correctly in the results window. How can I convert unicode strings to numbers? Any help would be appreciated.

Code: Select all
TestRow := RECORD
DATA field1;
UNICODE field2;
UNICODE field3;
UNICODE field4;
END;
FileTestData := DATASET('~testa', TestRow, CSV);
OUTPUT(FileTestData);


Code: Select all
TestRow := RECORD
INTEGER field1;
REAL field2;
REAL field3;
REAL field4;
END;
FileTestData := DATASET('~asciitextnums', TestRow, CSV);
OUTPUT(FileTestData);
andrew
 
Posts: 1
Joined: Mon Jun 27, 2011 11:37 pm

Tue Jun 28, 2011 9:23 am Change Time Zone

The problem is that ,CSV on the DATASET definition implies that the input file is encoded using latin1 8 bit encoding.

If you replace CSV with UTF8 it will read the input file as a UTF8 file. i.e.,

FileTestData := DATASET('~testa', TestRow, UTF8);

The system doesn't currently support direct reading of utf16be/le, utf32 files. However the file spray does allow you to convert to/from utf16 to utf8.

And please feel free to submit a feature request for directly reading utf16... it should be possible to autodetect the format in most situations.
ghalliday
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 198
Joined: Wed May 18, 2011 9:48 am

Tue Jun 28, 2011 11:55 am Change Time Zone

Hi Andrew,

The Language Reference also mentions this:

Casting UNICODE to VARUNICODE, STRING, or DATA is allowed, while casting to any numeric type will first implicitly cast to an ASCII STRING and then cast to the target value type.

See the section on type casting on page 51 of the Language Reference Manual.

Regards,

Bob Foreman
robert.foreman@lexisnexis.com
 
Posts: 5
Joined: Thu Mar 31, 2011 4:39 pm

Tue Jun 28, 2011 12:52 pm Change Time Zone

Someone pointed out to me the correct syntax is

FileTestData := DATASET('~testa', TestRow, CSV(UNICODE));

(UTF8 is currently an undocumented synonym for UNICODE in this context.)

I suspect we should support ,UTF8 as a synonym for ,CSV(UTF8). I'll investigate that..
ghalliday
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 198
Joined: Wed May 18, 2011 9:48 am

Wed Jul 13, 2011 12:37 pm Change Time Zone

ghalliday wrote:And please feel free to submit a feature request for directly reading utf16... it should be possible to autodetect the format in most situations.


What is the best method to submit a feature request?

Thank you,
Todd
thildebrant
 
Posts: 18
Joined: Mon Apr 11, 2011 4:39 pm

Wed Jul 13, 2011 1:32 pm Change Time Zone

Hi Todd,

Please submit a Feature Request via the Community Issue Tracker here on this web site:

You can get there from this link:

http://hpccsystems.com/support

Best regards,

Bob Foreman
bforeman
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1005
Joined: Wed Jun 29, 2011 7:13 pm


Return to Programming

Who is online

Users browsing this forum: No registered users and 1 guest

cron