Sat Sep 26, 2020 3:46 pm
Login Register Lost Password? Contact Us


Reading a UTF-16 file

Questions around writing code and queries

Wed Jan 08, 2020 4:00 pm Change Time Zone

Hi all,

I saw a thread from 2011 where it stated:

The system doesn't currently support direct reading of utf16be/le, utf32 files. However the file spray does allow you to convert to/from utf16 to utf8.

And please feel free to submit a feature request for directly reading utf16


Does anyone know if this is still the case - of no support of UTF16?

I have used encoding := 'utf16le' as part of my call to Fileservices.SprayVariable (seems our version doesn't support this in STD.File.SprayDelimited - even though the docs in that environment show it.)

When I try and then read it from a DATASET definition the data is not clean, trying several options:

Code: Select all
DATASET('logical_file_name', layout, CSV(HEADING(1),SEPARATOR(','))
DATASET('logical_file_name', layout, CSV(HEADING(1),SEPARATOR(','), UNICODE)
DATASET('logical_file_name', layout, CSV(HEADING(1),SEPARATOR(','), UNICODE16)
Last edited by SChatman85 on Fri Jan 10, 2020 9:02 am, edited 1 time in total.
SChatman85
 
Posts: 14
Joined: Mon Sep 02, 2019 2:13 pm

Wed Jan 08, 2020 5:06 pm Change Time Zone

Hi Stewart,

In the Language Reference Manual, there is support for UTF-16 using the UNICODE field value type in the RECORD structure. The ECL Watch allows delimited spraying using a variety of UTF options. There is also support for converting to/from UNICODE formats using the FROMUNICODE and TOUNICODE functions.

Specifically what are you trying to do?

Regards,

Bob
bforeman
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1002
Joined: Wed Jun 29, 2011 7:13 pm

Wed Jan 08, 2020 5:11 pm Change Time Zone

Hi Bob,

Thanks for the reply. I'm just looking to read a file which is provided in the format of:

Little-endian UTF-16 Unicode text, with very long lines, with CRLF, CR line terminators

I added the encoding value as mentioned above, so the data looks fine when viewing the Sprayed file in ECL Watch - but if I put it into a dataset definition, and output it, then I get extra characters which I am assuming is the extra byte.

Octal dump of raw file:

0003760 " \0 I \0 n \0 s \0 t \0 a \0 l \0 m \0
0004000 e \0 n \0 t

ECL IDE output of the dataset:

"Instalment

Expected output: Instalment

Hope that makes sense.
SChatman85
 
Posts: 14
Joined: Mon Sep 02, 2019 2:13 pm

Wed Jan 08, 2020 5:19 pm Change Time Zone

I added the encoding value as mentioned above, so the data looks fine when viewing the Sprayed file in ECL Watch - but if I put it into a dataset definition, and output it, then I get extra characters which I am assuming is the extra byte.


What is the version of the HPCC cluster you are using? Can you look at the ECL tab of the sprayed file and see the RECORD structure generated? If you use that RECORD with your DATASET how does the OUTPUT look?

Bob
bforeman
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1002
Joined: Wed Jun 29, 2011 7:13 pm

Wed Jan 08, 2020 6:16 pm Change Time Zone

SChatman85,
ECL IDE output of the dataset:

"Instalment

Expected output: Instalment
Can you show us your ECL definition of the RECORD structure and DATASET declaration that produced this result, please?

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1556
Joined: Wed Oct 26, 2011 7:40 pm

Thu Jan 09, 2020 12:00 pm Change Time Zone

Hi Bob,

1. Our Production cluster is currently running on 6.4.38.
2. The definition is showing in ECL watch as:

Code: Select all
RECORD
    UTF8 field1;
    UTF8 field2;
    ...


I tried using this as my layout in the DATASET definition, but made no difference.


Hi Richard.

Starting with the DATASET definition I have been trying with the following variations:

Code: Select all
DATASET( 'logical_filename',input_lay, CSV( HEADING(1),SEPARATOR(','), TERMINATOR(['\n', '\r\n']), MAXLENGTH(40000)))

Code: Select all
DATASET( 'logical_filename',input_lay, CSV( HEADING(1),SEPARATOR(','), TERMINATOR(['\n', '\r\n']), UNICODE, MAXLENGTH(40000)))


I have now tried the 3 following Layout definitions:

Code: Select all
input_lay := RECORD
    UTF8 field1;
    UTF8 field2;
    UTF8 field3;
    ...
END;


Code: Select all
input_lay := RECORD
    STRING field1;
    STRING field2;
    STRING field3;
    ...
END;


Code: Select all
input_lay := RECORD
    UNICODE field1;
    UNICODE field2;
    UNICODE field3;
    ...
END;
SChatman85
 
Posts: 14
Joined: Mon Sep 02, 2019 2:13 pm

Thu Jan 09, 2020 2:16 pm Change Time Zone

Hi Stewart,

In the DATASET, the SEPARATOR and TERMINATOR options are probably not needed since you are using the Delimited spray defaults.

Try adding a locale to the UTF8 field in the RECORD (or specify the proper Locale in the RECORD statement itself. According to the docs:

The optional locale specifies a valid unicode locale code, as specified in ISO standards 639 and 3166 (not needed if LOCALE is specified on the RECORD structure containing the field definition).


If you are still having trouble reading the file, I would suggest submitting a JIRA with all of the details and perhaps some sample data if possible. If you are reading the data in the ECL Watch properly, but not in the ECL IDE, there could be an issue there. I would also try the ECL command line and see what your result looks like in the console.

Regards,

Bob
bforeman
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1002
Joined: Wed Jun 29, 2011 7:13 pm

Thu Jan 09, 2020 2:59 pm Change Time Zone

Stewart,
Octal dump of raw file:

0003760 " \0 I \0 n \0 s \0 t \0 a \0 l \0 m \0
0004000 e \0 n \0 t
Could you post a Hex Dump of this data? Your Octal dump looks to me like it's showing 2 bytes per character and each leading byte is a Hex 00. If that's the case, then instead of defining the fields with UTF8 I'd suggest you try using UNICODE.

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1556
Joined: Wed Oct 26, 2011 7:40 pm

Thu Jan 09, 2020 5:02 pm Change Time Zone

Hi Richard,

Here's a hexdump, hope this format is ok?

Code: Select all
00000000  ff fe 22 00 44 00 41 00  54 00 45 00 5f 00 53 00  |..".D.A.T.E._.S.|
00000010  54 00 41 00 52 00 54 00  22 00 2c 00 22 00 44 00  |T.A.R.T.".,.".D.|
00000020  41 00 54 00 45 00 5f 00  45 00 4e 00 44 00 22 00  |A.T.E._.E.N.D.".|


I believe I tried to define as UNICODE before I tried UTF8 on Bobs suggestion, based on what ECL Watch was showing in the record definition, as it renders there fine.
SChatman85
 
Posts: 14
Joined: Mon Sep 02, 2019 2:13 pm

Thu Jan 09, 2020 5:54 pm Change Time Zone

Stewart,
Did you try setting the LOCALE in the RECORD statement? What format did you use to spray it? UTF-16, UTF-32? Perhaps it's a matter of translation.

Bob
bforeman
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1002
Joined: Wed Jun 29, 2011 7:13 pm

Next

Return to Programming

Who is online

Users browsing this forum: Bing [Bot] and 1 guest

cron