This example illustrates the simple use case of connecting to an external database that uses a driver which is pre-installed with Data360 Analyze.
See also Connecting to a Cloudera Hadoop cluster for an example of how to connect to a database that uses a driver which is not shipped with Data360 Analyze.
Your installation includes a pre-built data flow called "Acquiring data from MS Access", located in the Samples workspace, found in the Analyze Directory page.
This example imports data from an MS Access database using the JDBC Query node, and then combines that data with additional records contained in three different CSV files.
The following steps walk you through the process of building this data flow from scratch:
-
From the Directory, select Create > Data Flow.
-
In the Nodes panel, search for the JDBC Query node and drag it onto the canvas.
-
Select the JDBC Query node, then enter a SQL statement in the SqlQuery property to query your database. For example:
select * from orders
-
Configure the following Connection properties:
- From the list of database types in the DbType property, select MS Access.
- In the DbName property, enter the full path to your MS Access file, for example:
/opt/Data360Analyze/docs/samples/data/GettingStarted.accdb
Tip: When connecting to most other databases, you only need to enter a database name in this property. - DbHost - In the case of MS Access, this property can be left blank. If you were connecting to a different database, you would enter the database host name in this property e.g.
localhost
.
-
Run the JDBC Query node to import the MS Access data. The record count shows 1000 records:
-
Drag a Calculate Fields node onto the canvas and connect it to the JDBC Query node.
-
To prepare for merging the MS Access data with the data contained in the CSV files, convert the "order_id" field to string data by configuring the node properties as follows:
- In the ConfigureFields property, type:
#Configure all fields from input 'in1' to be mapped
#to the corresponding fields on the output 'out1'
out1 += in1
#Specify the data type of the new output field
out1.order_id = str
In the ProcessRecords property, type:
#Copy all fields from input 'in1' to the corresponding output fields
#in output 'out1'. Copies the fields who have been setup
#in the mapping defined in the ConfigureFields property
out1 += in1
#Convert each 'order_id' record to a str data type
out1.order_id = str(int(in1.order_id))
#Copy all fields from input 'in1' to the corresponding output fields
#in output 'out1'. Copies the fields who have been setup
#in the mapping defined in the ConfigureFields property
out1 += in1
#Convert each 'order_id' record to a str data type
out1.order_id = str(int(in1.order_id))
- Run the Calculate Fields node.
Importing additional data
The next step is to import the CSV data that is contained within three separate CSV files. The best way to quickly import data from multiple CSV files is to use the Directory List node.
- In the Nodes panel, search for the Directory List node and drag it onto the canvas. Configure the Directory List node properties, as follows:
- In the DirectoryName property, specify the path to where the CSV files are located. In the example, the CSV files are located in the
docs/samples/data
folder of your installation directory, so the DirectoryName property is configured to point to this location, as follows:{{%ls.home%}}/docs/samples/data
- In the Pattern property, specify that you want to import all CSV files that contain "order_detail_" in their filename by typing:
order_detail_*
- In the DirectoryName property, specify the path to where the CSV files are located. In the example, the CSV files are located in the
- Drag a CSV/Delimited Input node onto the canvas and connect it to the Directory List node. Run the Directory List node and click the record count to open the data viewer.
The field that contains the paths to the three CSV files is called "FileName". In the next step, you need to reference this field name in the CSV/Delimited Input node.
- Select the CSV/Delimited Input node and choose the (from Filename Field) variant of the Data property, then type
FileName
, or click the menu button and select Input Fields > in1 > FileName.Run the CSV/Delimited Input node to import the CSV data. Then, click on the record count of the CSV/Delimited node to view the data in the data viewer.
- Drag a Trim Fields node onto the canvas and connect it to the CSV/Delimited Input node. Run the Trim Fields node to remove any leading or trailing white space characters.
Merging data
Now that you have successfully imported the MS Access and CSV data, the next step is to merge the data.
-
Drag a Merge node onto the canvas, and connect the Transform node to the left input and the Trim Fields node to the right input. Configure the node properties, as follows:
- In the Match Keys Grid, select
order_id
for the Left Field. This is the name of the field in the MS Access input data that you want to merge with the CSV data. - For the Right Field, select
order_ref
. This is the name of the field in the CSV input data that contains the data that you want to merge with the Access data.
- In the Match Keys Grid, select
-
Run the Merge node.You have now merged the Access data with the CSV data.
-
To publish the merged data, drag an Output Excel node onto the canvas and configure the File property by clicking the folder icon to navigate to where you want to output the data.