Before attempting to analyze any data you will commonly find you need to make sure the data is prepared for analysis. This can include ensuring that data is of the right data type, that it is correctly formatted, and that the data set contains only the fields that are relevant for your analysis. Data360 Analyze includes a range of nodes that will help you prepare your imported data for analysis.
Modify fields
When data is imported from CSV or Excel files, all fields are of the type String. When performing analysis on data it is often important to know that a field should be treated as a specific data type, such as integer (whole number), double (decimal number) or date.
For this reason, it is very common to follow an import from file with a Modify Fields node, which can be used to convert string data into typed data, inferring the type to convert to from the type of data it finds in the records.
- Following on from the previous steps in importing data, locate the Modify Fields node in the Nodes panel and drag and drop two instances of the node into the data flow.
- Drag a connection wire from the output of the "Transactions" node onto the input pin of the first Modify Fields node.
- Select this Modify Fields node. In the Properties panel you'll see the OutputFields options for the node. You can use these options to include or exclude fields, rename fields, and change the data type of fields. You can also let the node automatically choose the correct data type for each field, which is what we'll do. In the Type column, make sure the Auto box is ticked.
- Now connect the "OrderInfo" node to the second Modify Fields node, and again make sure the Auto box is ticked in the OutputFields options.
- Change the Name property of both Modify Fields nodes to "Convert data types".
- Select both Modify Fields nodes (either by holding down Ctrl when clicking to select the nodes, or by holding down Shift and dragging a selection box over them) and click Run.
- Hover over the output pins of one of the Modify Fields node. The pins are called data, and errors.
- Click the output pin of the "Transactions" node and the data output pin of the "Convert data types" node that is connected to it. In the data viewer, compare the data types for the columns in each table. You will see that the "AccountID" column has been converted from unicode to int, and the "Date" column has been converted from unicode to date. Compare the "OrderInfo" node with its modified output in the same way.
Clean data
Before we start using our transactions data, let's briefly check that it's in good shape. We'll use the data quality indicators on the data viewer.
- Further inspection of the data viewer shows Data quality indicators for each column.
- You can see that the indicator for the Product column is not complete and lighter green than the others.
- Click on the data quality indicator for the Product column.
- Here Field Statistics are shown for the column, of interest is the percentage data quality, which is currently at 96%, there is also an indication that there are 4 values with leading or trailing white space
- In the Nodes panel, change from Favorites to All Nodes if necessary to show all of the available nodes. Trim Fields has a pretty specific use case and is not included in the most commonly used set of nodes.
- Start typing "trim" in the Search nodes field. The nodes that are displayed in the nodes panel are filtered as you type.
- Locate the Trim Fields node, drag it onto the canvas and connect it up to the data output pin of the "Transactions" Modify Fields node. Change the Name property of this node to "Trim leading and trailing white spaces".
- This node does not need any extra configuration, so you can Run the node and click the output pin record count to view the resulting data set.
- To confirm that this node has given us what we want, open the Trimmed Data pin on the Trim Fields node.
- You will now see the data quality indicator for the Product column is full and bold green, clicking on this data quality indicator will show you the data quality is now 100% and we have cleaned up our data.
Sort data
It is often useful to sort data by a particular field. At the moment, the "Order Info" data set is not sorted, but ifwe want it sorted it by product name, we can do this by using a Sort node.
- In the Search Nodes field, type "Sort".
- Select the Sort node and drag it onto the canvas.Change the name of the node to "Sort by Product field".
Make sure you select the Sort node, and not the Sort (Deprecated) node, which is an older version of the node.
- Connect data output pin of the Modify Fields node that is connected to the "OrderInfo" node to the first input pin of the Sort node.
- In the node options, you can use the "SortBy" property to select one or more fields from the data set. Click to add a field, and select "Product" from the list.
- Run the node, and click the output pin to view the sorted data set in the data viewer.
Exclude fields
The "ProductInfo" node contains a field called "LastOrderDate" that we want to exclude from our data. We can use another Modify Fields node for this.
- Place a new Modify Fields node on the canvas.
- Change the Name property to "Exclude LastOrderDate field", and connect the node to the "ProductInfo"node.
- In the Output Fields table, under Patterns and Fields, find 'LastOrderDate' and untick the box next to it. This will exclude the column from the output when you run the node.
- Select the "Exclude LastOrderDate field" node, and click Run.
- Click the "data" output pin for the "Exclude LastOrderDate field" node. In the data viewer, verify that the "LastOrderDate" column does not appear.
Filter data
Our Product data set contains data for two shipping cycles, which is detailed in the "ShippingCycle" field. For now we are only interested in shipping cycle 14, so we will need to filter out the orders by using a Filter node.
- In the Search Nodes field, type "Filter".
- Select the Filter node and drag it onto the canvas.
- Connect the data output pin of the "Exclude LastOrderDate field" node to the input pin of the Filter node.
- Select the Filter node and rename it to "Shipping Cycle 14".
- In the Criteria table, under "Match all of the following", click in the Field column to select a field to filter on.
- Select "ShippingCycle".
- Make sure the Operator is "Equals".
- In the Value column, enter "14".
- Run the node. When you view the output in the data viewer, you'll see that it now contains only the records for Shipping Cycle 14.
Next steps
Now that you have trimmed, cleaned and filtered your data sets , they are ready to be merged. Save the latest changes to your data flow by clicking the Save icon in the top left corner of the toolbar, and move on to the next stage, see Merging data.