Imports delimited data, for example a comma-separated values file (CSV), from an input data source.
You can import delimited data from an external file, or you can connect an input node that contains delimited data.
The CSV/Delimited Input node has two default output pins, and one optional output pin. Any invalid records are listed on the errors pin, all other records are listed on the output pin.
You can add a third output pin when you want to detect the CSV format of the input data. If you have added a third output pin, you can choose to only output data on the third pin without processing the records by setting the DetectOnly property to True. When you have configured the node in this way, only the third output pin will contain any data, showing the detected file information. The detected file information includes a number of fields, with the information for each input file shown on a new record.
The InputIdentifier column shows the input file number, where the first file is listed as 0
.
The CharacterSetConfidence column displays a number between 0 and 100 to show how confident the system is in the chosen character set, where 100 is the most confident. If the FileCharacterSet property is not set to AutoDetect, this column will show a NULL value.
Importing delimited data from a file
- Drag a CSV/Delimited Input node onto the canvas.
- By default, the (from Filename) variant of the Data property is selected. Click the folder icon to browse to the delimited file that you want to import, for example:
C:/Users/<username>/Documents/csv-example.csv
on Windows or/home/<user>/documents/csv-example.csv
on Linux.Note: If you are using the server edition and wish to use a file on your local machine, you must first upload it to the Data360 Analyze server. See Upload data files for more details. - Run the CSV/Delimited Input node to import your data.
Importing delimited data from more than one file
In this example, you have two CSV files stored in a directory. The CSV files contain related data, but not all fields are the same. To import the contents of these two files, you can use a Directory List node to point to the directory and connect this node as an input to the CSV/Delimited Input node.
The first file contains the following data: | The second file contains the following data: | |||||||||||||||||||||||||||||||||||||||||||||
|
|
- Drag a Directory List node onto the canvas.
- In the DirectoryName property, type the file path to the folder that contains the CSV files that you want to import, for example:
C:/Users/<username>/Documents
on Windows, or/home/<user>/documents
on Linux. - In the Pattern property, specify the file name and file type pattern against which to match the files in the directory. In this example, you want to import all CSV files from the specified directory, so you would type:
*.csv
Tip: If the data that you want to import is located in different directories, or is contained within a mixture of file types, you could use the Create Data node to list the file paths. - Drag a CSV/Delimited Input node onto the canvas and connect it to the output of the Directory List node.
- Select the (from Filename Field) variant of the Data property.
- Type the name of the input field containing the list of files that you want to import. In this case, it is the default field name on the Directory List node:
FileName
- In this example, the fields are not the same across both files, and we want to include all fields in the output, therefore on the ConcatenationMode property, select the union option.
- Run the CSV/Delimited Input node to import and merge your data from the two CSV files.
The output of the node is as follows:
Planet Distance from the sun (AU) Diameter (KM) Type FileName Mercury 0.39 4879 NULL C:/Users/<username>/Documents/csv-example.csv Venus 0.73 12104 NULL C:/Users/<username>/Documents/csv-example.csv Earth 1 12742 NULL C:/Users/<username>/Documents/csv-example.csv Mars 1.38 6779 NULL C:/Users/<username>/Documents/csv-example.csv Jupiter 5.2 139822 NULL C:/Users/<username>/Documents/csv-example.csv Saturn 9.58 116464 NULL C:/Users/<username>/Documents/csv-example.csv Uranus 19.22 50724 NULL C:/Users/<username>/Documents/csv-example.csv Neptune 30.1 49244 NULL C:/Users/<username>/Documents/csv-example.csv Mercury NULL NULL Terrestrial C:/Users/<username>/Documents/csv-example2.csv Venus NULL NULL Terrestrial C:/Users/<username>/Documents/csv-example2.csv Earth NULL NULL Terrestrial C:/Users/<username>/Documents/csv-example2.csv Mars NULL NULL Terrestrial C:/Users/<username>/Documents/csv-example2.csv Jupiter NULL NULL Gas C:/Users/<username>/Documents/csv-example2.csv Saturn NULL NULL Gas C:/Users/<username>/Documents/csv-example2.csv Uranus NULL NULL Gas C:/Users/<username>/Documents/csv-example2.csv Neptune NULL NULL Gas C:/Users/<username>/Documents/csv-example2.csv
Importing delimited data from a Create Data node
Example 1 - All input data is contained in a single field
You have the following data in a Create Data node:
Planet data:string"Planet:string,Distance from the sun (AU):double,Diameter (KM):int\nMercury,0.39,4879\nVenus,0.73,12104\nEarth,1,12742\nMars,1.38,6779\nJupiter,5.2,139822\nSaturn,9.58,116464\nUranus,19.22,50724\nNeptune,30.1,49244"
- Drag a CSV/Delimited Input node onto the canvas and connect it to the output of the Create Data node.
- On the CSV/Delimited Input node, select the (from Data Field) variant of the Data property.
- Click the menu button in the Data property and from the Input Fields menu, select the field that contains the input data, in this case
Planet data
. - On the PassThroughFields property, select None, as we do not want to include
Planet data
as a field in the output. - On the HeaderMode property, select Typed. By default, the first record will be used to form the output field names and by setting this property to True, the data types of each field will also be included.
- Run the CSV/Delimited Input node.
The node outputs eight records:
Planet
string
Distance from the sun (AU)
double
Diameter (KM)
int
Mercury 0.39 4879 Venus 0.73 12104 Earth 1 12742 Mars 1.38 6779 Jupiter 5.2 139822 Saturn 9.58 116464 Uranus 19.22 50724 Neptune 30.1 49244
Example 2 - Extracting headers from an input field
You have the following data in a Create Data node:
Header,Data"Product,Unit_Price,Quantity","Coffee,2.50,5""Product,Unit_Price","Tea,2.20""Product,Unit_Price","Chocolate,2.80""Product,Unit_Price","Orange Juice,3.10"
The "Header" field contains the values that you want to convert to field names in the output, and the "Data" field contains the data.
- Drag a CSV/Delimited Input node onto the canvas and connect it to the output of the Create Data node.
- On the CSV/Delimited Input node, select the (from Data Field) variant of the Data property.
- Click the menu button in the Data property and from the Input Fields menu, select the field that contains the input data, in this case
Data
. - On the PassThroughFields property, select None, as we do not want to include
Header
andData
as fields in the output. - Select the (from Field) variant of the FieldNames property, then select the field that contains the header names, in this case
Header
. - As the first row of data contains an additional "
Quantity
" value that is not found in the other records, set the ConcatenationMode property to union so that theQuantity
field is output and NULL values are shown for all records that do not contain a value. - Run the CSV/Delimited Input node.
The node outputs four records with the values from the "Header" field used as the field names:
Product
string
Unit_Price
string
Quantity
string
Coffee 2.50 5 Tea 2.50 NULL Chocolate 2.80 NULL Orange Juice 3.10 NULL
Properties
Data
Specify the data that you want to import.
Choose the (from Filename) variant of this property to specify the name of the file containing the data.
Choose the (from Filename Field) variant of this property to specify the name of an input field containing the file names.
Choose the (from Data Field) variant of this property to specify the name of the input field containing the data.
A value is required for this property.
FileCharacterSet
Optionally specify the appropriate character set of the CSV data to process.
Select one of the provided character sets, or type the name of a character set if the one that you require is not in the dropdown list.
Choose the (from Field) variant of this property to specify the name of an input field containing the character set.
The default value is Unicode BOM.
All of the options in the dropdown are defined character sets except for Unicode BOM and AutoDetect.
Unicode BOM is a special case which checks for a Byte-Order Mark (BOM) and if present will use the character set indicated by the BOM, otherwise will default to use UTF-8.
AutoDetect is a special case used for automatically detecting the character set of the input data.
This property is ignored if the (from Data Field) option is selected for the Data property.
Format
Optionally specify the format of the input. You need to specify a format when the input data is not in standard CSV format as defined by RFC 4180, for example, set this property if the data you are working with is tab-delimited. By default we support RFC 4180 but allowing for comment lines, empty lines and ignoring unquoted spaces surrounding fields.
Choose from:
- AutoDetect - The node will detect the CSV format of the input data. You can add a third output pin to output the detected format information. If this option is selected, the DetectOnly property can be set to True to only detect the format and not process the records.
- Excel - CSV file format when a file is saved as CSV from Microsoft Excel.
- InformixUnload - Default Informix CSV UNLOAD format used by the UNLOAD TO file_name operation.
- InformixUnloadCsv - Default Informix CSV UNLOAD format used by the UNLOAD TO file_name operation (escaping is disabled).
- RFC 4180 - Comma separated format as defined by RFC 4180.
- TDF - Tab-delimited format.
Choose the (from Field) variant of this property to specify the name of an input field containing the CSV format.
FieldDelimiter
Optionally specify the field delimiter. The field delimiter is the sequence of character(s) used to separate fields in the CSV data to be imported. For example, use \t as the field delimiter for a file with fields that are separated by tabs. If the fields are separated by commas, enter the comma character "," (enterthe character only, with no quotes).
The default value is dependent on the value set in the Format property.
Any number of characters can be used to identify the field delimiter.
Choose the (from Field) variant of this property to specify the name of an input field containing the field delimiter.
RecordDelimiter
Optionally specify the record delimiter. The record delimiter is the character(s) used to separate records in the CSV data to be imported. For example, use \n as the record delimiter for a file with records that are separated by newlines. If the records are separated by question marks, enter the question mark character "?" (just enter the character, with no quotes).
The default value is dependent on the value set in the Format property.
Up to 2 characters can be used to identify the record delimiter.
Choose the (from) variant of this property to specify the name of an input field containing the record delimiter.
EscapeCharacter
Optionally specify an escape character (for example backslash \).
The default value is dependent on the value set in the Format property.
Choose the (from Field) variant of this property to specify the name of an input field containing the escape character.
QuoteCharacter
Optionally specify a quote character.
The default value is dependent on the value set in the Format property.
Choose the (from Field) variant of this property to specify the name of an input field containing the quote character.
HeaderMode
Optionally specify the Header mode to be used when reading the input file.
The options are:
- None: choose this when there is no header on the input file.
- Typed: choose this when there is a header on the input file and it declares the types of each column using the following format: :
- UnTyped: choose this when there is a header on the input file but there is no type information for each column.
The default value is UnTyped.
Choose the (from Field) variant of this property to specify the name of an input field containing the header mode.
DateFormat
Optionally specify the format of the incoming Date fields.
The default format is based on the system locale.
Choose the (from Field) variant of this property to specify the date format of date typed input fields.
TimeFormat
Optionally specify the format of the incoming Time fields.
Choose the (from Field) variant of this property to specify the time format of date typed input fields.
DateTimeFormat
Optionally specify the format of the incoming Datetime fields.
Choose the (from Field) variant of this property to specify the datetime format of date typed input fields.
NullMode
Optionally specify what values on the incoming dataset will be read in as Null values.
Choose from the following options:
- Empty Value Is Null: Null will be used where two delimiters in a row with no quoting are found.
- Quoted Empty Value Is Null: Null will be used where two identical quoting characters in a row are found.
- Custom Value is Null: Null will be used where the custom value, as specified in the "NullString" property is found.
- Custom Quoted Value is Null: Null will be used where the custom value, as specified in the "NullString" property is found within quotes.
- None: There is no processing of Nulls, nothing will be treaded as a Null value.
Choose the (from Field) variant of this property to specify the Null Mode from an input field.
NullString
Specify the string that will be used to identify NULL characters when processing the file.
Used when "NullMode" is set to either:
- Custom Value is Null
- Custom Quoted Value is Null
ConcatenationMode
Optionally specify how to merge data. Choose from:
- Union - Outputs all of the fields across the CSV data sources, filling in nulls for fields not present in a given input.
- Exact - The fields for all processed CSV data sources must be the same.
- Intersection - Outputs all the fields common to the CSV data sources.
The default value is Exact.
PassThroughFields
Optionally specify which input fields will "pass through" the node unchanged from the input to the output, assuming that the input exists. The input fields specified will appear on those output records which were produced as a result of the input fields. Choose from:
- All - Passes through all input data fields to the output.
- None - No input data fields are passed through to the output; as such, only the fields created by the node appear on the output.
- Used - Passes through all the fields that the node used to create the output. Used in this case will be the field referenced in the Data property when using the Filename Field or Data Field option.
- Unused - Passes through all fields that the node did not use to create the output.
The default value is Used.
UnescapedQuoteHandling
Optionally specify how to handle unescaped quote values found in an input record. Choose from:
- Error - When an unescaped quote is found, the current record will be written to the error output.
- Skip value - When an unescaped quote is found, the current value will be skipped and a Null value will be output.
- Stop at Closing Quote - When an unescaped quote is found, the value is treated as a quoted value. The field value ends when a closing quote is found.
- Stop at Delimiter - When an unescaped quote is found, the value is treated as an unquoted value. All characters will then be accumulated, until the delimiter or a line ending is found.
- Back to Delimiter - When an unescaped quote is found, the value is treated as an unquoted value. All characters will then be accumulated until the delimiter is found. Where no delimiter is found in the value, then characters will continue to accumulate until a delimiter or line ending is found.
The default is set to Error.
IgnoreSurroundingSpaces
Optionally specify whether spaces around values are ignored when parsing input.
The default value is dependent on the value set in the Format property.
Choose the (from Field) variant of this property to specify the name of an input field containing a True or False value to determine whether surrounding spaces should be ignored.
FieldNames
Optionally specify a comma-separated list of names to be used as the output field names.
Choose the (from Field) variant of this property to specify the name of an input field containing the field names.
By default, the node assumes that the first line of the input CSV data is the header record and its contents are used to derive the output field names.
When using the (from Field) variant of this property, the Data property must use either the (from Filename Field) or (from Data Field) variant.
UseFieldNames
Optionally specify when to apply the field names specified in the FieldNames property.
Choose the (from Field) variant of this property to specify the name of an input field containing the specification.
Choose from:
- If Set - The value in the FieldNames property is always used if there is any value in that property.
- If No Output Fields - The value in the FieldNames property is only used if it is set, and there was no data in any of the files to process - resulting in no output fields on the node.
The default value is If Set.
SkipHeaderRows
Optionally specify the number of lines in the CSV data to skip prior to attempting to parse the CSV data.
Choose the (from Field) variant of this property to specify the name of an input field containing the number of header rows that should be skipped for the corresponding CSV source.
The default value is 0.
DetectOnly
Optionally specify whether or not the node should only detect the CSV format and character set of the input data.
The default value is False.
If set to True, the node must have three output pins. The detected format information will be written to the third output.
If set to True, the following must also be set:
- Format = AutoDetect.
-
ConcatenationMode = union.
If the CSVs are not all in the same format, then the system displays an error without the union setting.
- NumSampleRows = 1.
After the node has read in the file, you can limit the number of records that are used later in your analysis, by adding a Head node to the output of the CSV/Delimited Input node.
NumSampleRows
Optionally specify the maximum number of rows in the input data to use when auto-detecting the CSV format.
This property has no effect unless the Format property is set to AutoDetect.
The default value is 100.
MaxCharsPerColumn
Optionally specify the maximum number of characters that can be read in a single field value.
This property is useful when processing CSV files which have a lot of data in one or more fields.
The default value is 102400.
MaxColumns
Optionally specify the maximum number columns that can be read in from a single CSV file.
The default value is 5000.
CommentCharacter
Set the character which incidcates that a line in a CSV file is a comment, and thus will be ignored.
To disable the ignoring of comment lines populate this property with \0.
The default is #.
BadDataBehavior
Optionally specify the behavior of the node when it encounters a malformed record. Choose from:
- Error - Fail if any errors are detected.
- Log - Generate a warning if any errors are detected.
- Ignore - Ignore any errors that are detected.
The default value is Log.
NoDataBehavior
Optionally specify the behavior of the node when it encounters a file to process which contains no data (empty or missing file). Choose from:
- Error - Fail if any file to process contains no data.
- Log - Generate a warning if any file to process contains no data.
- Ignore - Ignore any files containing no data.
The default value is Error.
DuplicateFieldNameBehavior
Optionally specify what to do if an input file contains multiple fields with the same field names in the header row. You can select from the following options:
- Error - The node fails.
- AutoName - Each of the duplicate header fields (excluding the first) will be output with the field name _ where indicates the number of times that field has been seen in the header row.
DuplicateFieldNameBehavior can be set on a per record input basis, making it possible to process each input file differently before the files are combined as per the "ConcatenationMode" property.
The default value is Error.
MissingFIeldNameBehavior
Optionally specify what to do if a value within the header row is missing. You can select from the following options:
- Error - The node fails.
- Ignore - The field is not included in the metadata, and all data in this column is ignored.
- AutoName - An autogenerated field name is used based on the column number. The data within this column is then not ignored.
MissingFieldNameBehavior can be set on a per record input basis, making it possible to process each input file differently before the files are combined as per the "ConcatenationMode" property.
The default value is Error.
MissingFieldBehavior
Optionally specify what to do if record within an input file contains missing fields. You can select from the following options:
- Error - The record is reported on the Errors pin and the node fails with an error.
- Log - The record is reported on the Errors pin, the error counts towards the ErrorThreshold property.
- Ignore - The record is written to the output pin with empty values for the missing fields.
MissingFieldsBehavior can be set on a per record input basis, making it possible to process each input file differently before the files are combined as per the "ConcatenationMode" property.
The default value is Error.
AdditionalFieldBehavior
Optionally specify what to do if record within an input file contains additional fields. You can select from the following options:
- Error - The record is reported on the Errors pin and the node fails with an error.
- Log - The record is reported on the Errors pin, the error counts towards the ErrorThreshold property.
- Ignore - The record is written to the output pin ignoring any additional fields found.
AdditionalFieldsBehavior can be set on a per record input basis, making it possible to process each input file differently before the files are combined as per the "ConcatenationMode" property.
The default value is Error.
SubstituteInvalidCharacters
There are some reserved characters which cannot appear in metadata. This optional property defines what to do if invalid characters appear in the input data and are to be used in the record metadata.
- True - Invalid characters are substituted for acceptable BRD metadata characters.
- False - The node errors and stops processing.
The default value is False.
BadEncodingBehavior
Optionally specify the behavior of the node when it encounters characters in the CSV data which cannot be read using the provided character set. Choose from:
- Report - Fail the node and report the problems that were encountered.
- Ignore - Ignore any errors that are detected.
- Replace - Auto-replace any such characters with a replacement character ('?'). No warnings will be reported.
The default value is Report.
BadEncodingReplacementChar
Optionally specify an alternative replacement character to use when the node encounters characters in the CSV data which cannot be read using the provided character set.
A single character must be specified, the default value is "?".
ErrorThresholdCount
Optionally specify the number of records that may contain bad record data prior to the node failing.
By default, there is no limit to the number of invalid records. Typically you would only set either the ErrorThresholdCount or the ErrorThresholdPercentage.
The value must be a valid, non-negative integer. When set to 0, the node will terminate on the first CSV record with errors.
ErrorThresholdPercentage
Optionally specify which percentage of records are allowed to have errors before failing the node.
By default, there is no limit to the number of invalid records. Typically you would only set either the ErrorThresholdCount or the ErrorThresholdPercentage.
The value must be a valid integer between 0 and 100 (inclusive).
If set, the node will abort processing as soon as it has processed at least 100 records, and the percentage of records in error reaches the specified percentage threshold.
Example data flows
A number of sample Data Flows are available from the Samples workspace, found in the Analyze Directory page.
In the Directory under the /Data360 Samples/Node Examples/
folder, you will find "Reading and Writing CSV Data", which shows examples of how to use this node.
Inputs and outputs
Inputs: in1.
Outputs: output, errors, 1 optional.