Data Tech

How to use ChatGPT for Data Analysis (II)

In our previous article “How to Use ChatGPT for Data Analysis (Part I)” from February 27, we explored some of the capabilities of ChatGPT in data analysis, such as code generation and the creation of synthetic datasets.

In today’s article, we go further, delving into additional useful applications, starting from exploratory data analysis (EDA) with graphs and statistical models, to concluding with the necessary documentation for support.

Task 4: applications related to data analysis in a narrow sense

Data Cleaning and Preparation: request methods for handling missing data, removing duplicates, or converting data types, or describe the desired outcome by requesting translation into Python code.
Exploratory Data Analysis (EDA): generation of descriptive statistics, graphs, and correlations.
Statistical Modeling and Machine Learning: request code to build, evaluate, and optimize models.

Prompt Composition Strategies

Describe the specific problem and the data available.
Indicate the programming language and libraries to use (e.g., Python with pandas, scikit-learn):
- you can express preferences by indicating to use Seaborn where possible instead of Matplotlib
- you can specify how to order the values in the graphs shown and which color palettes to prefer
- you can indicate which statistical models to invoke
Ask for best practices and specific advice for your use case.

Practical Use Cases with ChatGPT

Now, let’s imagine a different, but much more frequent, usage context, having a dataset of which only the generalities are known. It’s useful to conduct an initial analysis phase to understand, for example, cardinality, data type, and data distribution. But also to highlight possible errors or incompleteness, or correlations or patterns to investigate in subsequent steps.

The first step begins with EDA. For this example, we used a dataset with the Halloween candy rating index and proceeded to:

1. Load the dataset specifying how to interpret the textual content, thanks to the description available on Kaggle, to request a descriptive statistical analysis.
2. Evaluate the results and ask for further levels of analysis or insights.
3. Export the obtained code and continue the analysis in a comparative and/or in-depth manner.

The following gif shows the workflow outlined in the previous points:

When it comes to exporting the work, a new use case for ChatGPT is introduced: writing the documentation with details about the dataset, the steps and calculations required, and the summary of insights derived from the analysis conducted.

Although ChatGPT is a powerful and versatile tool, it cannot replace the analyst, who is called upon to use the tool critically to maximize benefits and minimize sources of error. Therefore, it is crucial to be able to inspect the Python code generated at the code level – and possibly intervene to correct – the instructions that are generated.

Below is the initial interaction for point 1.

In the following image, an overview of the dataset is shown, with the main characteristics of the imported file, and in this case too, the request is translated into executed Python code, whose results are interpreted to return a textual output in natural language.

By accessing the Code Interpreter’s inspector, it’s clear how the request to check for the presence or absence of null values is handled by ChatGPT by executing the following code.

				
					#Dataset Overview
dataset_overview = {
    'Number of Rows': candy_data.shape[0],
    'Number of Columns': candy_data.shape[1],
    'Column Names': candy_data.columns.tolist(),
    'Missing Values': candy_data.isnull().sum().sum()  # Total missing values across the dataset
}

For example, the instruction candy_data.isnull().sum().sum() tells us whether there are missing values in the dataset or not. In this way, ChatGPT utilizes the functionalities of the Pandas library to return an interpretation of the results obtained. After being briefed on the context of analysis, the tool correctly handled the binary encoding (1/0) used to indicate the presence or absence of a certain ingredient, evaluating the distribution of popularity and ingredients over the number of candies examined and graphically representing the results, thanks to the Matplotlib and Seaborn libraries.

From the bar charts, we can infer that chocolate and fruit flavors are present in about half of the 85 candies tested, while other ingredients are less common.

What follows is what is obtained by requesting a statistical-descriptive analysis.

These visualizations emerge without writing a line of code: a request formulated in natural language has allowed ChatGPT to extrapolate and represent this information in an intuitive and immediate manner.

Access to the code is quick and intuitive, as shown in the gif, and allows for directly copying the instructions to paste into a .py file.

With some more specifics, it’s also possible to obtain particular visualizations, not strictly standard, like the matrix with the profile of ingredients for the 5 most appreciated candies.

An additional analysis could include the use of multivariate regression, where the independent variables (our ingredients, in this case categorical with values 0 and 1) are evaluated as a whole. For example, estimating the percentage of liking through the average of the “winpercentage” obtained from all the candies containing those ingredients. This could provide a deeper understanding of the compositional profile of a candy. ChatGPT, invoking models like scikit-learn’s LinearRegression, helps us explore these complex relationships.

With a determination coefficient R2 equal to 0.776, the way in which combinations of ingredients influence the popularity of candies could be considered significant. For instance, if chocolate has a high coefficient, this suggests a significant impact on the liking percentage. For example, chocolate has a high coefficient, meaning it plays a significant role in the liking percentage. Therefore, if we wanted to launch a new snack on the market, this could be a possible direction for analysis, without limiting ourselves to only evaluating the compositional profile that emerges from the matrix of ingredients of the top 5 candies. Indeed, the presence of puffed rice wafer seems to be the second most important factor.

Final Considerations

In conclusion, ChatGPT can be a valuable aid for conducting general analysis on a relatively simple dataset (denormalized data model with a single table), deriving insights through the formulation (prompting) of the task to be performed and the context (description of the application domain) in natural language, without the need to technically/operationally know how to achieve that result. In this perspective, one could input the transaction log (Excel or CSV) of a business and request the total sales, the average selling price per product category, or the percentage change compared to the previous month (MoM). What follows is an example of such interaction, but it is important to emphasize that ChatGPT is a useful but fallible tool, to be seen as an assistant and not a specialist. For this reason, it is crucial to formulate tasks unambiguously, ensuring to define the context useful for the analysis in as much detail as possible. The following are the questions posed by ChatGPT, which constitute, as you can guess, the heart of this prompting strategy, capable of guiding even the less experienced users in the correct and complete definition of the request before it is taken over by the model.

Output generated by ChatGPT on the right and output calculated with a BI tool on the left:

This is why, in the example that concludes this article, the code generated to respond to the user’s requests is exported.

Read all our articles on Data Science

Do you want to discover the latest news about Fivetran and new data science technologies?

Visualitics Team
Questo articolo è stato scritto e redatto da uno dei nostri consulenti.

Condividi ora sui tuoi canali social o via email: