Purpose
To define a partition scheme that can be applied to a target.
Format
/PARTITION | distribution |
where
distribution | { by_size_only | by_count_only | by_condition | by_field } |
by_size_only | size_constraint |
by_count_only | count_constraint [ROUNDROBIN maximum_size_per_round] |
size_constraint | PARTITIONSIZE [COMPRESSED] maximum_partition_size |
count_constraint | PARTITIONCOUNT number_of_partitions |
by_condition | BYCONDITION condition [,condition…] [UNSELECTEDRECORDS partition_suffix] [NOEMPTY] |
by_field | BYFIELD [size_constraint | count_constraint] |
maximum_size_per_round | maximum_size |
maximum_partition_size | maximum_size |
maximum_size | r_number RECORDS | b_number BYTES | k_number KILOBYTES | m_number MEGABYTES | g_number GIGABYTES | t_number TERABYTES |
Arguments
x_number |
A positive non-zero integer that indicates the maximum amount of data, in one of the following units, which is written either in total or in each round to each partition: records - r_number bytes - b_number kilobyes - k_number megabytes - m_number gigabytes - g_number terabytes - t_number |
number_of_partitions | A positive non-zero integer that indicates the exact number of partitions that will be created. Depending on the data and other options, some partitions may be empty. |
condition | The name of a condition defined in a /CONDITION option or a level 88 in a COBOL copybook supplied through a /DATADICTIONARY option, that is evaluated to determine whether a record gets selected into a given partition. |
partition_suffix | A character string that is the suffix to be added to the target file name. The resulting file name is used for the partition where all records that are not selected by any of the conditions are put. |
Location
This option may appear anywhere in the task definition.
Notes
The /PARTITION option is applicable to the last output file that is listed before the partition statement.
Target partitioning is not supported for XML, Indexed, Micro Focus, Tableau Data Extract, or QlikView data eXchange files.
For Apache Avro and Apache Parquet file formats, target partitioning is supported only with the following distributions (see the syntax rules above for definitions):
- by_count_only
- by_condition
- by_field, with count_constraint option only
Distribution by size only
A partition scheme based on size only is one constrained by the maximum size of each partition.
The maximum_partition_size determines the maximum amount of data that can be written to each partition. As records are not broken across partitions, the amount of data written to a given partition might be less than the maximumum amount specified. A minimum of 1 record is written to each partition.
Distribution by count only
A partition scheme based on count only is one constrained by the number of partitions.
The number_of_partitions determines the exact number of partitions that are created. As records are not broken across partitions, the amount of data written to a given partition might be less than the maximumum amount specified. Depending on the amount of data, some partitions might be empty. A minimum of 1 record is written to each partition.
If number_of_partitions is not specified, Connect ETL applies a default value, which could change between multiple executions of the task.
The maximum_size_per_round determines the amount of data that is written to a partition before moving to the next partition.
Distribution by condition
A partition scheme based on condition is one in which a condition is specified for each partition. The number of conditions determines the number of partitions. For each target record, the conditions are evaluated in sequence to determine whether the record goes into the corresponding partition. Once a condition is true for a record, subsequent conditions are not evaluated.
Records that are not selected by any of the specified conditions are dropped unless the UNSELECTEDRECORDS keyword is specified. In this case, they go into a separate partition.
Empty partitions are not created when the NOEMPTY keyword is specified.
Distribution by field
A partition scheme based on field is one in which all records with the same field values are distributed to the same partition. Field based partitioning can be used with sort and merge applications only. The sort and merge field values are used to partition the file. Field based distribution cannot be specified when the task includes the /SUMMARIZE option.
When PARTITIONCOUNT is specified, Connect ETL attempts to distribute the sets of records with equal field values such that each partition has approximately the same number of records.
When PARTITIONSIZE is specified, Connect ETL attempts to create partitions of the specified size, while keeping all records with the same field values together.
When neither PARTITIONCOUNT nor PARTITIONSIZE is specified, each partition contains the records for one set of sort or merge field values.
Compressed Size
When the COMPRESSED attribute is specified, target compressed size is considered when partitioning output files.
Examples
/OUTFILE Customers.dat STLF
/PARTITION partitionsize 200 megabytes
/OUTFILE Customers.dat STLF
/PARTITION partitioncount 20 roundrobin 250 kilobytes
/OUTFILE Customers.dat STLF
/PARTITION conditional bycondition IsCountryUSA, IsCountryNotUSA unselectedrecords
CountryUnknown noempty
/OUTFILE Customers.dat STLF
/PARTITION SortFieldBased byfield partitioncount 10
The target file Customers.dat is split into 10 partitions named Customers_1.dat, Customers_2.dat etc. where all the records with the same sort or merge field values are sent to the same partition. Connect ETL tries to distribute data such that the partitions are approximately the same size.