Reading a UTF-16 file
Hi all,
I saw a thread from 2011 where it stated:
Does anyone know if this is still the case - of no support of UTF16?
I have used encoding := 'utf16le' as part of my call to Fileservices.SprayVariable (seems our version doesn't support this in STD.File.SprayDelimited - even though the docs in that environment show it.)
When I try and then read it from a DATASET definition the data is not clean, trying several options:
I saw a thread from 2011 where it stated:
The system doesn't currently support direct reading of utf16be/le, utf32 files. However the file spray does allow you to convert to/from utf16 to utf8.
And please feel free to submit a feature request for directly reading utf16
Does anyone know if this is still the case - of no support of UTF16?
I have used encoding := 'utf16le' as part of my call to Fileservices.SprayVariable (seems our version doesn't support this in STD.File.SprayDelimited - even though the docs in that environment show it.)
When I try and then read it from a DATASET definition the data is not clean, trying several options:
- Code: Select all
DATASET('logical_file_name', layout, CSV(HEADING(1),SEPARATOR(','))
DATASET('logical_file_name', layout, CSV(HEADING(1),SEPARATOR(','), UNICODE)
DATASET('logical_file_name', layout, CSV(HEADING(1),SEPARATOR(','), UNICODE16)
Last edited by SChatman85 on Fri Jan 10, 2020 9:02 am, edited 1 time in total.
- SChatman85
- Posts: 17
- Joined: Mon Sep 02, 2019 2:13 pm
Hi Stewart,
In the Language Reference Manual, there is support for UTF-16 using the UNICODE field value type in the RECORD structure. The ECL Watch allows delimited spraying using a variety of UTF options. There is also support for converting to/from UNICODE formats using the FROMUNICODE and TOUNICODE functions.
Specifically what are you trying to do?
Regards,
Bob
In the Language Reference Manual, there is support for UTF-16 using the UNICODE field value type in the RECORD structure. The ECL Watch allows delimited spraying using a variety of UTF options. There is also support for converting to/from UNICODE formats using the FROMUNICODE and TOUNICODE functions.
Specifically what are you trying to do?
Regards,
Bob
- bforeman
- Community Advisory Board Member
- Posts: 1006
- Joined: Wed Jun 29, 2011 7:13 pm
Hi Bob,
Thanks for the reply. I'm just looking to read a file which is provided in the format of:
Little-endian UTF-16 Unicode text, with very long lines, with CRLF, CR line terminators
I added the encoding value as mentioned above, so the data looks fine when viewing the Sprayed file in ECL Watch - but if I put it into a dataset definition, and output it, then I get extra characters which I am assuming is the extra byte.
Octal dump of raw file:
0003760 " \0 I \0 n \0 s \0 t \0 a \0 l \0 m \0
0004000 e \0 n \0 t
ECL IDE output of the dataset:
"Instalment
Expected output: Instalment
Hope that makes sense.
Thanks for the reply. I'm just looking to read a file which is provided in the format of:
Little-endian UTF-16 Unicode text, with very long lines, with CRLF, CR line terminators
I added the encoding value as mentioned above, so the data looks fine when viewing the Sprayed file in ECL Watch - but if I put it into a dataset definition, and output it, then I get extra characters which I am assuming is the extra byte.
Octal dump of raw file:
0003760 " \0 I \0 n \0 s \0 t \0 a \0 l \0 m \0
0004000 e \0 n \0 t
ECL IDE output of the dataset:
"Instalment
Expected output: Instalment
Hope that makes sense.
- SChatman85
- Posts: 17
- Joined: Mon Sep 02, 2019 2:13 pm
I added the encoding value as mentioned above, so the data looks fine when viewing the Sprayed file in ECL Watch - but if I put it into a dataset definition, and output it, then I get extra characters which I am assuming is the extra byte.
What is the version of the HPCC cluster you are using? Can you look at the ECL tab of the sprayed file and see the RECORD structure generated? If you use that RECORD with your DATASET how does the OUTPUT look?
Bob
- bforeman
- Community Advisory Board Member
- Posts: 1006
- Joined: Wed Jun 29, 2011 7:13 pm
SChatman85,
HTH,
Richard
Can you show us your ECL definition of the RECORD structure and DATASET declaration that produced this result, please?ECL IDE output of the dataset:
"Instalment
Expected output: Instalment
HTH,
Richard
- rtaylor
- Community Advisory Board Member
- Posts: 1619
- Joined: Wed Oct 26, 2011 7:40 pm
Hi Bob,
1. Our Production cluster is currently running on 6.4.38.
2. The definition is showing in ECL watch as:
I tried using this as my layout in the DATASET definition, but made no difference.
Hi Richard.
Starting with the DATASET definition I have been trying with the following variations:
I have now tried the 3 following Layout definitions:
1. Our Production cluster is currently running on 6.4.38.
2. The definition is showing in ECL watch as:
- Code: Select all
RECORD
UTF8 field1;
UTF8 field2;
...
I tried using this as my layout in the DATASET definition, but made no difference.
Hi Richard.
Starting with the DATASET definition I have been trying with the following variations:
- Code: Select all
DATASET( 'logical_filename',input_lay, CSV( HEADING(1),SEPARATOR(','), TERMINATOR(['\n', '\r\n']), MAXLENGTH(40000)))
- Code: Select all
DATASET( 'logical_filename',input_lay, CSV( HEADING(1),SEPARATOR(','), TERMINATOR(['\n', '\r\n']), UNICODE, MAXLENGTH(40000)))
I have now tried the 3 following Layout definitions:
- Code: Select all
input_lay := RECORD
UTF8 field1;
UTF8 field2;
UTF8 field3;
...
END;
- Code: Select all
input_lay := RECORD
STRING field1;
STRING field2;
STRING field3;
...
END;
- Code: Select all
input_lay := RECORD
UNICODE field1;
UNICODE field2;
UNICODE field3;
...
END;
- SChatman85
- Posts: 17
- Joined: Mon Sep 02, 2019 2:13 pm
Hi Stewart,
In the DATASET, the SEPARATOR and TERMINATOR options are probably not needed since you are using the Delimited spray defaults.
Try adding a locale to the UTF8 field in the RECORD (or specify the proper Locale in the RECORD statement itself. According to the docs:
If you are still having trouble reading the file, I would suggest submitting a JIRA with all of the details and perhaps some sample data if possible. If you are reading the data in the ECL Watch properly, but not in the ECL IDE, there could be an issue there. I would also try the ECL command line and see what your result looks like in the console.
Regards,
Bob
In the DATASET, the SEPARATOR and TERMINATOR options are probably not needed since you are using the Delimited spray defaults.
Try adding a locale to the UTF8 field in the RECORD (or specify the proper Locale in the RECORD statement itself. According to the docs:
The optional locale specifies a valid unicode locale code, as specified in ISO standards 639 and 3166 (not needed if LOCALE is specified on the RECORD structure containing the field definition).
If you are still having trouble reading the file, I would suggest submitting a JIRA with all of the details and perhaps some sample data if possible. If you are reading the data in the ECL Watch properly, but not in the ECL IDE, there could be an issue there. I would also try the ECL command line and see what your result looks like in the console.
Regards,
Bob
- bforeman
- Community Advisory Board Member
- Posts: 1006
- Joined: Wed Jun 29, 2011 7:13 pm
Stewart,
HTH,
Richard
Could you post a Hex Dump of this data? Your Octal dump looks to me like it's showing 2 bytes per character and each leading byte is a Hex 00. If that's the case, then instead of defining the fields with UTF8 I'd suggest you try using UNICODE.Octal dump of raw file:
0003760 " \0 I \0 n \0 s \0 t \0 a \0 l \0 m \0
0004000 e \0 n \0 t
HTH,
Richard
- rtaylor
- Community Advisory Board Member
- Posts: 1619
- Joined: Wed Oct 26, 2011 7:40 pm
Hi Richard,
Here's a hexdump, hope this format is ok?
I believe I tried to define as UNICODE before I tried UTF8 on Bobs suggestion, based on what ECL Watch was showing in the record definition, as it renders there fine.
Here's a hexdump, hope this format is ok?
- Code: Select all
00000000 ff fe 22 00 44 00 41 00 54 00 45 00 5f 00 53 00 |..".D.A.T.E._.S.|
00000010 54 00 41 00 52 00 54 00 22 00 2c 00 22 00 44 00 |T.A.R.T.".,.".D.|
00000020 41 00 54 00 45 00 5f 00 45 00 4e 00 44 00 22 00 |A.T.E._.E.N.D.".|
I believe I tried to define as UNICODE before I tried UTF8 on Bobs suggestion, based on what ECL Watch was showing in the record definition, as it renders there fine.
- SChatman85
- Posts: 17
- Joined: Mon Sep 02, 2019 2:13 pm
Stewart,
Did you try setting the LOCALE in the RECORD statement? What format did you use to spray it? UTF-16, UTF-32? Perhaps it's a matter of translation.
Bob
Did you try setting the LOCALE in the RECORD statement? What format did you use to spray it? UTF-16, UTF-32? Perhaps it's a matter of translation.
Bob
- bforeman
- Community Advisory Board Member
- Posts: 1006
- Joined: Wed Jun 29, 2011 7:13 pm
14 posts
• Page 1 of 2 • 1, 2
Who is online
Users browsing this forum: No registered users and 1 guest