Launched in November 2022 by OpenAI, ChatGPT quickly made its mark in the “human” world, turning terms like “AI”[1], “Machine Learning”[2], and “Language Model”[3] into almost common parlance. In a nutshell, ChatGPT is a complex artificial intelligence model (trained on a virtually limitless corpus of texts and fine-tuned with a combined strategy of reinforcement and supervised learning), developed for generating textual content through conversational interactions akin to a “chatbot”. That is, requests are made in natural language, and the responses received are evaluated.
But how can ChatGPT be used in the field of Data Analysis to speed up processes and suggest approaches? In this column, we will introduce some possible uses of ChatGPT in data analysis: from the generation of synthetic datasets to enriched exploratory analysis (with graphs and statistical models), and finally, to the necessary accompanying documentation.
ChatGPT Prompting Strategies
At the heart of interacting with ChatGPT lies the concept of the prompt, the user’s message that provides context and instructions for the interaction.
Generally, prompting strategies revolve around two phases, with more or less defined outlines:
1. Context Setting: in this phase, information is provided to the model. One can be more or less specific, but it’s conceivable that the precision of the generated responses is potentially directly proportional to the accuracy and quantity of the information provided.
2. Task Definition: here, the task that the model must perform is clearly and unambiguously established. A complex task can be divided into sequential sub-tasks, allowing for precise control over ChatGPT’s internal Code Interpreter’s execution, and, if necessary, course corrections.
These phases are crucial for effectively guiding ChatGPT in generating relevant and accurate responses, thus improving the overall interaction experience. In the field of data science, this ranges from requesting the writing of specific code for certain functionalities, such as managing data import and export, to analysis, including graphs and statistical models, and to the creation of documentation. But let’s get to the crucial point, exploring some possible uses of ChatGPT in data analysis, along with guidance for composing optimal prompts for each task.
From the ChatGPT 4 version (for Plus accounts), for every conversation, the user has access to a virtual environment, in which a Python interpreter (including the Advanced Data Analysis tool) is active, capable of executing code written in this language. But written by whom? By ChatGPT, of course!
Task 1: Code Generation
Example: Data Import/Export
Prompt Composition Strategies:
- Specify the programming language (e.g., Python).
- Indicate the file format (CSV/JSON…) and the desired data structure.
- Provide data examples or describe the data layout [optional but recommended if the structure is complex].
Task 2: Code Generation
Example: Request to Implement a Feature
Prompt Composition Strategies:
- Specify the programming language (e.g., Python).
- Indicate the file format (CSV/JSON…) and the data structure, if you have a dataset.
- Clearly detail the feature to be implemented.
- Provide examples of the properties and/or the desired outcome’s format.
Task 3: Synthetic Data Creation
Example: Creating a sample dataset with specific characteristics
Prompt Composition Strategies:
- Specify the application domain.
- Indicate the desired format and structure of the data (for tabular structure, specify: number of columns, headers, data types, value ranges… and define what they represent).
- Ask ChatGPT to provide a series of questions to better characterize the context.
- Provide data examples or describe the data layout [optional but recommended if the structure is complex].
Suppose you are working on developing a demonstrative project, a use case to propose to potential clients, but you do not have an appropriate dataset. If you are interested in generating a modest number of values, mostly of a categorical (non-numerical) nature that also present a fair variability, you can certainly rely on ChatGPT for the creation of a synthetic dataset. In the example below, the application domain will be exposed, then ChatGPT will be asked to formulate a series of questions deemed useful to better understand our request, task, and finally, the dataset synthesis.
The context and task are defined using the Ask Before Answer technique. The interaction begins by outlining the application context and continues with the indication of the task to be completed, followed by an explicit request to provide appropriate questions to increase the accuracy of the response, and thus guide the user in specifying all the relevant details. The following are some of the questions posed by ChatGPT, which, as you can guess, are at the heart of this prompting strategy, capable of guiding even less experienced users in the complete formulation of the request before it is taken over by the model.
Below is a possible response.
At this point, the generated text can be copied and pasted into a text editor and saved as a CSV file, ready to be used as a sample dataset, perhaps enriching it with a column for the date.
And by adding some visualizations as follows.
Conclusions
We’ve discovered how ChatGPT can revolutionize Data Analysis, from creating synthetic datasets to exploratory analysis and documentation. We’ve highlighted the importance of prompting strategies to achieve accurate results and how ChatGPT facilitates and enriches the work of analysts.
Don’t miss the next episode, where we will explore further useful applications of ChatGPT in Data Analysis.
Read all our articles on Data Science
Do you want to discover the latest news about Fivetran and new data science technologies?
Visualitics Team
This article was written and edited by one of our consultants.
Sources:
[1] www.trends.google.com
[2] www.trends.google.com
[3] www.trends.google.com
Share now on your social channels or via email: