Quick Start
Analyzing taxi traffic in New York
This example is based on the public dataset that can be found here .
It is a thirty million lines dataset with detailed taxi trips data for one year.
Tutorial Overview
In this tutorial, we will build a workflow to clean and prepare raw data, followed by an analysis of traffic patterns. The focus will be on identifying trends by day of the week and categorizing trips based on their duration.
First step: preview and load data
When you select the file to load an instant preview of data will be available.

Data Preview and Type Inference
The preview displays the first 10,000 lines of the file. During this stage, heuristic rules are applied to infer the most appropriate data type and format for each column. For date fields, the system attempts to detect the format and distinguish between day/month and month/day ordering; in this case, it correctly identifies the American format (month before day). For numeric fields, it determines whether values are integers or floating-point numbers, and identifies the presence and style of thousands and decimal separators.
Manual Adjustment of Column Types
While data types and formats are generally inferred accurately during preview, manual configuration may be necessary in specific cases. For instance, a column containing only integers in the first 10,000 lines may later include floating-point values in the full dataset. Some patterns may also be misinterpreted due to limited preview data. A common example is the representation of monthly periods as values like 2022.01, which may be inferred as floating-point numbers. To enable operations such as extracting year and month components, it is recommended to explicitly set the column type to “string”.
Column Selection and Renaming
It may be necessary to exclude certain columns from processing and assign more descriptive names to others. In this example, the column “VendorId” is ignored, and the remaining columns are renamed for clarity. Additionally, the float precision is set to two decimal places.

Once the layout is finalized, selecting the Load All button initiates loading of the entire dataset into memory. For a file containing 30 million rows and 18 columns, approximately 5 GB of memory is required.