Excel File - Data360_Analyze - Latest

Data360 Analyze Server Help

Product type
Software
Portfolio
Verify
Product family
Data360
Product
Data360 Analyze
Version
Latest
Language
English
Product name
Data360 Analyze
Title
Data360 Analyze Server Help
Copyright
2024
First publish date
2016
Last updated
2024-11-28
Published on
2024-11-28T15:26:57.181000

Imports data from one or more Microsoft Excel files (.xls, .xlsx, .xlsm), each of which may contain one or more worksheets.

To import a single Excel file, in the File property, enter the path of the Excel file that you want to import.

To import multiple Excel files, you first need to add an input to the Excel File node. For example, the Directory List node is often used as an input to import multiple Excel files. Then in the FilenameExpr property, enter the name of the field in the input data which contains the Excel filenames. For example, if the Directory List node lists the Excel files in a column called "FileName", in the FilenameExpr property, type FileName.

You can use either absolute or relative file paths, but note that if the file path is relative, then the temporary directory of the data flow for the current run is used as the base to which the file path is relative.

Note: You should only use either the File property OR the FilenameExpr property, not both.

Importing data when the data does not begin in cell A1

If the data in the Excel file that you are importing from does not begin in cell A1, you must point to the data in the WorkbookSpec property.

Tip: The WorkbookSpec property must always begin with <workbook> and end with </workbook>.

Example 1 - Header row immediately followed by data

In this example, you have data that starts in cell A2, with the column headers on row 2 and the data starting on row 3.

A B C
1
2 Rank Country Code
3 1 Denmark DNK
4 2 Sweden SWE
5 3 Finland FIN
6 4 Singapore SGP
7 5 France FRA
8 6 Spain ESP

To successfully import this data, add the following text to the WorkbookSpec property:

<workbook><sheet headerRow="2" /></workbook>

This indicates that the column headers should be taken from row 2 of the worksheet. There is only one worksheet, so you do not need to specify the worksheet number. Additionally, you do not need to specify that the data starts from row 3, as the default action is to take data from the row following the header row.

Example 2 - Gap between header row and data

In this example, you have column headers that start on row 6 and data that starts on row 8. Additionally, the data starts in the 2nd column. The last row of data is on row 113, and there are table notes below that row.

Importing a specific segment of data

To import a specific segment of the Excel file data, for example to import the column headers and the data on rows 8 through to 113, add the following text to the WorkbookSpec property:

<workbook><sheet headerRow="6" dataStartRow="8" dataEndRow="113" dataStartColumn="2" /></workbook>

This indicates that the column headers should be taken from row 6 of the worksheet, and the data should be taken from rows 8 to 113. The dataStartColumn text specifies that the data starts in column 2.

Tip: Row and column values need to be written inside double quotes.

Example 3 - Importing data from multiple worksheets

In this example, you have 14 different tables across 14 different worksheets. To import the data for Table 1 and Table 2 and ensure that we keep the data separate in the output:

  1. Add an additional output pin to the Excel File node, see Defining input and output pins.
  2. Add the following text to the WorkbookSpec property.

<workbook><sheet name="Table 1" headerRow="4" dataStartRow="6" dataEndRow="16" outputIndex="1"/><sheet name="Table 2" headerRow="4" dataStartRow="6" dataEndRow="16" outputIndex="2"/></workbook>

This indicates that from Table 1, the column headers from row 4 and data from rows 6 to 16 will be imported and will be output to the first output pin. From Table 2, the column headers from row 4 and data from rows 6 to 16 will be imported and will be output to the second output pin.

Tip: If you are importing data from two different worksheets that do not have the same name, and then outputting the data to a single output pin, you must define the outputIndex in the WorkbookSpec property e.g. <sheet index="1" outputIndex="1" headerRow="2" dataStartColumn="2"></sheet>

Properties

File

Click the folder icon and browse to the Excel file that you want to import.

If you use this property, do not also use the FilenameExpr property.

Note: If you are using the server edition and wish to use a file on your local machine, you must first upload it to the Data360 Analyze server. See Upload data files for more details.

FilenameExpr

To import multiple Excel files, add an input to the Excel File node, then in the FilenameExpr property, enter the name of the field in the input data which contains the Excel filenames. For example, if the Excel File node is connected to the output of a Directory List node, the FilenameExpr property would be set to FileName.

If you use this property, do not also use the File property.

IgnoreCellFormatting

Optionally specify whether to ignore cell formatting for numerical values. If set to true, the actual numerical values are output instead of the formatted values. For example, if a numeric value is 0.467857 with a formatting of 2 decimal places, when this property is set to true, the output will contain the value 0.467857.

The default value is False.

ConcatenationType

Optionally specify how the output metadata should be constructed based on the input data:

  • Intersection - The record metadata will be taken as the intersection of all fields going to the same output
  • Union - The record metadata will be taken as the union of all fields going to the same output
  • Exact Error - If the metadata does not match exactly (ignoring field order), then an error will be thrown
  • Exact New Output - If the metadata does not match exactly (ignoring field order), then a new output will be constructed for all of the worksheets which have metadata that did not match.

The default value is Exact New Output.

The Excel File node will first process the metadata for all of the worksheets in all of the input Excel files before determining which outputs the worksheets should be written to and the record metadata on each of these outputs.

Prior to determining which node outputs are to be used for each worksheet in the file, the worksheets are first placed into tentative output buckets.

This is done based first on the outputName or outputIndex properties specified in the WorkbookSpec if any are set. If no WorkbookSpec property is set or if the outputName or outputIndex is not set for a given worksheet in an input file, then the worksheet is bucketed according to the name of the worksheet in the input file.

Once these preliminary buckets have been established, this property determines how the record metadata will be set for the outputs.

Note that in the Exact New Output case, consider the case where there are four worksheets mapped to the same output bucket, A, B, C & D. Assume that the metadata for A is used first. If B, C & D contain metadata that does not match that of A, then these will be placed into a new tentative output bucket. If the metadata for D does not match that of B and C, then B & C will be placed in one output, and D into another.

DetectFieldTypes

Optionally specify whether or not the node should attempt to derive the output field types based on the cells in the workbook(s).

Since Excel files are cell-based rather than table-based, when this property is set to True, the node will perform a full pass over all of the data cells in the workbook(s) to determine the appropriate field type.

In cases where the field types for cells within the same column are different, the type on the output may be a numeric widening (e.g. int -> double) and if this is not possible - or if this property is set to False - the output type will be based on the DefaultType property.

The default value is True.

DefaultType

Optionally specify whether output character-based fields are to be in String or Unicode format.

The default value is Unicode.

If the DetectFieldTypes property is set to False then this property will be applied to all fields read from the Excel workbook(s).

This property is used for all fields except for the generated field "Row" which is a long.

TrimFieldNames

Optionally specify whether or not the node should trim any whitespace from the field names located in the Excel worksheet.

The default value is True.

WorkbookSpecExample

The WorkbookSpecExample property gives an example of the XML code that you can use in the WorkbookSpec property.

WorkbookSpec

To import an Excel file containing data that does not begin in cell A1, or to import a specific segment of an Excel file, add XML code to the WorkbookSpec property to point to the data that you want to import.

All indexes are 1-based (first is 1, second is 2). All attributes must have values surrounded by double quotes " even if they are numeric. Use the following key words when adding the XML code:

<workbook> Top level container. Contains elements <sheet> and <generatedFields>.

<sheet> Use this tag to refer to individual worksheets within an Excel file. There are various optional attributes that can be used within the <sheet> tag to describe the worksheet, and the <field> sub-elements to describe the fields, if not given in a headerRow:

  • name - Use this tag to lookup the sheet by name (optional), for example: <sheet name="SomeSheetName" headerRow="3"/>
  • index - Use this tag to refer to the sheet by position in the workbook (optional), for example: <sheet index="3" /> Note: Only one of the name and index attributes is allowed. If neither is given, the worksheet number is auto-incremented from the previous index (or 1 if initial). Auto-numbering cannot be used if the preceding sheet definition uses the name attribute.
  • outputIndex - Specifies the node output index to which the sheet should be sent (optional).
  • outputName - Specifies the node output name to which the sheet should be sent (optional). Only one of outputIndex and outputName attributes is allowed.
  • headerRow - The row of the sheet to be read for all field names (optional). Defaults to the first row. For example: <sheet name="Another Sheet Name" dataStartRow="5" headerRow="3" />
  • dataStartRow - The first row to be imported (optional). If not provided, it will default to the first row after the headerRow, or if the <fields> element is used, the first row.
  • dataEndRow - The last row to be imported (optional). If not provided, it will default to the last populated row in the worksheet.
  • dataStartColumn - The first column for each row to be processed. Applies to both header rows and data rows. If not specified, defaults to the first column in the sheet.
  • dataEndColumn - The last column for each row to be processed. Applies to both header rows and data rows. If not specified, defaults to the last column in the first row.
  • noDataAtIndexAction - Overrides the NoDataAtIndexAction property for the specified worksheet.

<generatedFields> - Contains additional <field> sub-elements. Use this tag to add generated fields that can be added to each record on all outputs (optional).

<field> - Optional elements inside <sheet> and <generatedFields> which are the ordered list to manually specify the field names. The <field> elements should only be used inside a <sheet> if the <sheet> attribute headerRow is not specified. Attributes:

  • name - The name of the field. Mandatory when inside <sheet>, optional inside <generatedFields> (defaults to type).
  • type - The type of generated field. Mandatory when inside <generatedField>, not allowed inside <sheet>. The type can be one of "filename", "sheet", and "row". For example: <generatedFields><field type="filename" />

The following example illustrates data being imported from the second and third worksheet within an Excel file. The first sheet has the first row as the header row, and the last sheet has data starting on the third row, and only the first column with the field "phone" is imported:

<workbook>

<sheet index="2" /><sheet dataStartRow="3" >

<field name="phone"/>

</sheet><generatedFields>

<field type="row"/>

<field type="sheet" name="source"/>

</generatedFields>

</workbook>

MetadataOutputPin

You can create a "Metadata" output pin which details where the worksheets in the input Excel files are output; in this optional property, you can specify the name of the "Metadata" output pin.

When this property is set, and the property UnmatchedOutputNameAction is set to Auto Output, then for each of the worksheets that cannot be mapped to a node output, a new output file is created. The "Metadata" output contains a reference to the name of the generated file, along with information about the input Excel file and worksheet containing the source data.

Alternatively, if there are multiple worksheets in your Excel files, you can choose to add outputs to the node for each of the different worksheets and reference them by using the outputIndex or outputName attributes in the WorkbookSpec property. Note that you cannot output worksheets to the metadata pin.

TrustedSource

Specify whether or not to trust the Excel file source.

XLSX and XLSM files are stored in a zip format. Malicious zip files can contain "zip bombs".

To prevent such attacks, the Excel File node is configured to error if it detects that the file is decompressing to a much larger size than the compressed version. This can lead to false positives where non-malicious files are detected to contain zip bombs simply due to the way that they are decompressed.

If you receive such errors while attempting to process an Excel File that you know is from a trusted source and cannot contain a zip bomb, set this property to True which will inform the node that the file is safe to process.

The default value is False.

UseSystemLocale

Optionally specify whether or not the output data is formatted using the system locale. The default is false.

The Excel File node imports an Excel File then outputs each cell as a formatted string value. If this property is set to false, then that string value is formatted in such a manner that subsequent nodes will be able to easily transform the string data to other formats, such as numeric or date. If this property is set to true, then you may not be able to easily format the data using subsequent Script expressions. However, you will be able to use the system Locale in other downstream nodes (for example the Java node) to transform the value.

UnmatchedOutputNameAction

If a WorkbookSpec property is set and a worksheet has a specified outputName or outputIndex, the worksheet will be output to the corresponding node output if it exists. If no outputName or outputIndex is specified on the WorkbookSpec property, or if the WorkbookSpec property is not defined, the worksheets in the input Excel files are attempted to be matched to the outputs based on the name of the output and the name of the corresponding worksheet in the input file. The UnmatchedOutputNameAction property defines what to do when there is no node output with the same name as that of the worksheet in an input Excel file. You can select from the following options:

  • Error - The node errors.
  • Ignore - The worksheet is ignored and is not output.
  • Log - The worksheet is not output, but a message is logged stating the reason.
  • Auto Output - The worksheet is output to a node output which does not have any corresponding worksheets in the input Excel files.

The default value is Auto Output.

Note: If you are using a "Metadata" output, leave this property to its default or set it to Auto Output. When the MetadataOutputPin property is set, for each of the worksheets that cannot be mapped to a node output, a BRD file is created with the output and referenced in the "Metadata" output pin.

NullHeaderCellAction

Optionally specify what to do if a cell within the header row contains null data or whitespace only. You can select from the following options:

  • Error - The node fails.
  • Ignore - The field is not included in the metadata, and all data in this column is ignored.
  • AutoName - An autogenerated field name is used based on the column number. The data within this column is then not ignored.

The default value is AutoName.

NullRowAction

Optionally specify what to do if an entire row of data (not the header) contains null data. This is regardless of whether they are at the end of a file or in the middle. You can select from the following options:

  • Error - If a null row is encountered, the node will error.
  • Log - If a null row is encountered, information about the row is logged, but no record is output.
  • Ignore - If a null row is encountered, no information is logged, and the record is not output.
  • Output - If a null row is encountered, the record is output with NULL entries for all empty fields.

The default value is Output.

NoDataAtIndexAction

Optionally specify what to do if a worksheet specified in the Excel file points to a sheet index, sheet name, or headerRow where there is no data. You can select from the following options:

  • Error - The node fails.
  • Log - The worksheet is ignored. An explanatory message is written to the log.
  • Ignore - The worksheet is ignored. Nothing is written to the log.

The default value is Error.

This property can be overridden within the WorkbookSpec property by using the <sheet> element.

DuplicateFieldNameAction

Optionally specify what to do if a worksheet contains multiple fields with the same field names in the header row. You can select from the following options:

  • Error - The node fails.
  • AutoName - Each of the duplicate header fields (excluding the first) will be output with the field name <FieldName>_<OccurenceCount> where <OccurrenceCount> indicates the number of times that field has been seen in the header row.

The default value is Error.

CellErrorAction

Optionally specify what to do if a cell contains an error, for example an invalid formula. You can select from the following options:

  • Error - The node fails.
  • Log - Information about the error is written to the log. The cell is output as NULL.
  • Ignore - The cell is output as NULL. No information is written to the log.

The default value is Error.

SubstituteInvalidChars

Data within the Excel File is used to construct the metadata for the record outputs (unless the WorkbookSpec defines the metadata explicitly). Some reserved characters are not allowed to appear in the metadata. This optional property defines what the node should do if such characters are found in Excel cells which are to be used in the record metadata. Choose from:

  • True - Ensures that characters such as ":", and newlines are substituted in the output metadata to form a legal field name.
  • False - The node errors if illegal characters are encountered.

The default value is False.

NoOutputPinErrorBehaviour

If the input data contains multiple worksheets, or different metadata, multiple output pins are required; this optional property allows you to specify what to do if multiple output pins do not exist. You can select from the following options, where Log is the default (if you select Log or Ignore, it is recommended that you set the MetadataOutputPin property to true to view the metadata and observe the differences):

  • Error - The node fails. Add additional output pins.
  • Log - A warning messages is logged. Data is acquired for the number of output pins that are available.
  • Ignore - Data is acquired for the number of output pins that are available. There is no indication of loss of data.

Example data flows

A number of sample Data Flows are available from the Samples workspace, found in the Analyze Directory page.

In the Directory under the /Data360 Samples/Node Examples/ folder, you will find "Reading and Writing to Excel", which shows examples of how to use this node.

Note: Upgrades will overwrite all data flows in the workspace. If you want to make changes to one of the sample data flows, we recommend you create a copy and save it elsewhere, using Save as...

Inputs and outputs

Inputs: 1 optional (filenames).

Outputs: out1, multiple optional (metadata and workbook data).