Efficient Value Type Usage

Architecting data structures is an art that can make a big difference in ultimate performance and data storage requirements. Despite the extensive resources available in the clusters, saving a byte here and a couple of bytes there can be important -- even in a Big Data massively parallel processing system, resources are not infinite.

Numeric Data Type Selection

Selecting the right type to use for numeric data depends on whether the values are integers or contain fractional portions (floating point data).

Integer Data

When working with integer data, you should always specify the exact size of INTEGERn or UNSIGNEDn that is required to hold the largest number possible for that particular field. This will improve execution performance and compiler efficiency because the default integer data type is INTEGER8 (also the default for Attributes with integer expressions).

The following table defines the largest values for each given type:

Type       Signed                      Unsigned 
INTEGER1   -128 to 127                 0 to 255
INTEGER2   -32,768 to 32,767           0 to 65,535
INTEGER3   -8,388,608 to 8,388,607     0 to 16,777,215
INTEGER4   -2,147,483,648              0 to 4,294,967,295
             to 2,147,483,647
INTEGER5   -549,755,813,888            0 to 1,099,511,627,775
             to 549,755,813,887
INTEGER6   -140,737,488,355,328        0 to 281,474,976,710,655
             to 140,737,488,355,327
INTEGER7   36,028,797,018,963,968      0 to 72,057,594,037,927,935
             to 36,028,797,018,963,967
INTEGER8   -9,223,372,036,854,775,808  0 to 18,446,744,073,709,551,615
             to 9,223,372,036,854,775,807

For example, if you have data coming in from the "outside world" where a 4-byte integer field contains values that range from zero (0) to ninety-nine (99), then it makes sense to move that data into an UNSIGNED1 field. This saves three bytes per record, which, if the dataset is fairly large one (say, 10 billion records), can translate into considerable savings on disk storage requirements.

One advantage ECL has over other languages is the richness of its integer types. By allowing you to select the exact number of bytes (in the range of one to eight), you can tailor your storage requirements to the exact range of values you need to store, without wasting extra bytes.

Note that the use of the BIG_ENDIAN forms of all the integer types is limited to defining data as it comes in and goes back out to the "outside world"--all integer data used internally must be in LITTLE_ENDIAN format. The BIG_ENDIAN format is specifically designed for interfacing with external data sources, only.

Floating Point Data

When using floating point types, you should always specify the exact size of the REALn required to hold the largest (and/or smallest) number possible for that particular field. This will improve execution performance and compiler efficiency because REAL defaults to REAL8 (eight bytes) unless otherwise specified. REAL values are stored internally in IEEE signed floating point format; REAL4 is the 32-bit format and REAL8 is the 64-bit format.

The following table defines the number of significant digits of precision and the largest and smallest values that can be represented as REAL (floating point) values:

Type    Significant Digits     Largest Value     Smallest Value
REAL4      7 (9999999)         3.402823e+038      1.175494e-038
REAL8   15 (999999999999999)   1.797693e+308      2.225074e-308

If you need more than fifteen significant digits in your calculations, then you should consider using the DECIMAL type. If all components of an expression are DECIMAL types then the result is calculated using BCD math libraries (performing base-10 math instead of floating point's base-2 math). This gives you the capability of achieving up to thirty-two digits of precision, if needed. By using base-10 math, you also eliminate the rounding issues that are common to floating point math.