Stress testing AI in user research

Or how I interviewed 17 people in three days and wrote a report in one, with the help of 3 AIs.

Published in

Bootcamp

10 min readNov 28, 2023

Last week I had a very particular research experience completely different from how I’m used to work normally, as I accepted a challenge to help our partner company in a state of emergency. So I had to conduct 17 interviews on my own in 3 days and write a report of that in 1.

(This is absolutely not how we are used to work in oblo and when I heard of that plan, I thought it was crazy and that my brain would collapse and won’t be able to deliver any quality outcome. Normally at oblo we have 2 people for interviews that alternate between conducting them and note taking, for 3 or max 4 interviews per day, and then synthesise things together in continuous discussion. I know that lots of people got used to doing interviews on their own with various transcription tools but I have never done that and my biggest fear was doing synthesis alone.)

So I decided that I really needed some additional, even if artificial, brains to help and used it as a chance to review the opportunities that they bring to our workflows that now I want to share.

My work setup for this task: Miro with the research wall on the left, report slides in the middle and the Synthesis helper GPT on the right :)

1. Notetaking

As mentioned, in oblo we normally do manual note taking. Even if we tried different tools for transcripts, they never fully satisfied us at the outcomes are much longer than manual notes and full of noise. This makes them really difficult to process later as you need to read a lot of very raw text. Instead, with the human not takers we add the first level of filtering and synthesis already during interview process.

In this instance, I was fortunate that the platform used for the work included a built-in transcription tool with several extra features. In addition to providing a comprehensive transcript and notes synchronized with it, the platform offered summaries and insights generated by ChatGPT for each interview. Although these insights were not particularly in-depth, they proved to be immensely useful for subsequent review. When constructing the research wall with post-its and highlighting key details from all participants, I didn’t need to sift through the entire transcript. Instead, I could quickly refer to the summaries and insights. While some crucial details were occasionally omitted, browsing through this information efficiently jogged my memory about specific elements mentioned by each user without the need to look through the whole interview transcript.

Key takeaways:

Summaries of interviews serve as an effective tool to refresh information for researchers. However, I would advise against using them for external audiences, as they lack the depth and human touch of the original content.
Manual note-taking remains superior to automatic transcripts. The latter often results in excessively lengthy texts that are challenging to process later, both for the human brain and, as we will discuss in the subsequent step, by current AI technologies.

—

2. AI synthesis helper set up

Skip this part if not interested in the AI tools overview :)

Fortunately, the weekend fell between the three days of interviews and the report writing. And I spent it in a state of anxious anticipation, bracing for the intense mental effort required on Monday. So I dedicated this period to exploring all potential tools that could assist in the task of processing 200 pages of transcripts by feeding them into an AI to pose questions.

(I ended up very angry with OpenAI that never properly explained features of custom GPTs and so I tried every other tools before getting to the most simple way of solving the problem.)

Chat GPT Plugins reading PDFs— turned out they couldn’t go beyond few pages of pdf and were not capable of supporting me with an overview of all users info.
Splitting transcripts in many messages to feed into ChatGPT manually — it remembered only the last messages and once again couldn’t really help lacking overview of all the users.
Eesel.ai — theoretically is built exactly for this type of tasks, but also it didn’t seem to process everything.
Using Bing as a copilot for PDF reading— same problem of not going beyond some pages and also Bing is really hard to communicate with
Local indexing with Simple index GitHub project — it is supposed ot read docs locally on your computer, but I’ve spent hours with ChatGPT trying to debug it without success (though my dad then managed to run it later, so I’m gonna exlpore in in the future).
Personalised assistant with OpenAI API — at some point googling for something about OpenAI API for the previous step I stumbled upon assistants and realised I could a personalised one. It still was behaving weird with 200 pages of texts, but probably was the closest to being able to process them. Minus of the method — paying for API use for each query which gets more expensive with large amount of data to process.
Personalised ChatGPT — finally after the assistants I realised I could do the same directly from My GPTs interface where you can add documents as “knowledge” for that specific conversation. It still had issues with the length of the transcript though, sometimes responding that it couldn’t look through al the document in a given time, but it was getting closer to what I needed and was included with my monthly Plus membership.
Miro AI — surprisingly it turned out to be quite handy when asked questions about the research board contents.

Eventually I ended up using 3 of those AIs:

2 versions of ChatGPTs — one with full transcript that had all the user quotes and one with only summaries and insights created by interview platform, so that it could process the whole document and have an overview of all contents.

Eesel with just summaries to double check if it would give me different answers than GPT (and sometimes it did).
Miro AI to add elements that might be missing.

*note — transcripts document didn’t have any sensitive information of users and I felt it was OK to feed them to the Chat, still I appreciated Bing’s claim that it wouldn’t use user’s docs for training while ChatGPT didn’t state anything like that.

—

3. Writing the report

Luckily I had slides templates provided to fill them in and didn’t need to think of much storytelling. Now with 3 AIs I felt a bit more confident about the process, and they helped me a lot, but the performance varied based on the tasks I gave it.

Warming up

Context immersion — before starting I asked to give me a brief overview of trends and general knowledge connected to the topic of research. This proved beneficial, sparing me the need to comb through multiple Google pages for relevant developments.
Behavioural archetypes —my attempt to have GPT categorize the interviewees into main behavioral clusters, a standard practice in our work, yielded unsatisfactory results. The AI produced very generic classifications and struggled with the task of assigning individuals to these clusters. Some archetypes included only one of the users, while others did not encompass any interviewees.
General insights — same issue with them being technically correct observations, but very generic and lacking depth.

Takeaway:

Despite these challenges, I believe GPT’s initial input was valuable in providing starting points for my own synthesis process. It helped determine which aspects were applicable and which were not. Consequently, I adopted a strategy of formulating my thoughts first and then asking GPT to expand upon them and here’s how it went.

General Insights

Once some of them started emerging in my head, I just gave their titles to GPT and asked to write them in detail based on the transcripts it had along with supporting quotes. It turned to be quite good at writing the descriptions, requiring only minimal adjustments from me. However, the relevance of the quotes varied significantly — at times, I was impressed by how well they matched the context, while at other times they lacked the necessary strength to effectively support the argument.

User profiles

Eventually, I identified myself four distinct groups of users. I asked GPT to describe them, adhering to a slide layout that I had screenshot and uploaded. Impressively, GPT recognized all the text fields and the appropriate amount of text for each, accurately depicting the archetypes. But again, selecting appropriate quotes proved challenging. Even when I specified key points for the quotes to highlight, the results were not always useful. But once again, those were helping me to recall some other things users said

The most challenging aspect was naming these groups in a way that accurately reflected their behaviors. Despite multiple revisions, I remained somewhat dissatisfied with the final names.

User journeys

It turned to be the most difficult part as The outputs from GPT were overly generic and failed to distinguish between different user profiles. I had to write many things myself and also asked it to ideate a general list of all the painpoints that people mentioned regarding certain moments of journey and then evaluate myself which of those would fit which archetype.

Opportunities summary

This was probably the most impressive moment of all as I just copy-pasted my post-its screenshot with a couple of keyword to GPT and it was capable of expanding the points with descriptions that perfectly fit the context and also suggested additional points which I might have missed.

Practical conclusions

Without a doubt, I couldn’t have completed this work so swiftly on my own. While GPT had its limitations in certain aspects, its capabilities in others, such as developing opportunity statements from a single keyword, were truly impressive. The final report doesn’t delve as deeply as our typical work at Oblo, but it serves its purpose in providing a rapid overview of the situation.

The primary benefit of using AI, I believe, is its role in simulating a discussion. This simulated dialogue offers valuable input, helping to sustain momentum in the thought process and prevent cognitive blocks. The quality of this interaction can be likened to working with an intern new to this type of task — they might struggle with independently creating user profiles or distinguishing deep insights from generic ones, but they can be productive given clear guidance.

Another significant advantage is the AI’s ability to process and query large volumes of text (though it has to be improved a lot technically). It can answer specific questions about the entire body of research, a task that would otherwise require manual review of all notes.

However, I would caution against leaving less experienced people to work solely with AI. I believe that I could manage it well due to years of experience and strong critical thinking as well as particular experience in the context of human and technology interaction which was the topic of research. This background enabled me to discern what was relevant and what was not. For those with less experience, there’s a risk they might rely too heavily on initial AI-generated results without engaging in further critical thinking or exploration.

Philosophical conclusion

A word of caution: don’t attempt to replicate this experience at home. Conducting back-to-back interviews with only 15-minute breaks is taxing on the mind, even with AI assistance. This raises a significant point for consideration. When I recently was at SDGC in Berlin there was one particular slide from IBM, among multiple talks about AI, which made me very worried:

It highlighted a concerning prospect: speeding up of all the activities along the typical design process, which makes me really worried about the state of human mind and wellbeing in those conditions once they become a standard across the sector and normalize expectations for accelerated speed of delivery for clients.

And I’m really afraid of the world in which we will think that once we got AI we can avoid having 2 people dedicated to the research (and I think that not even AGI would help here as it won’t have human experience ever). Of course for some level of broad exploratory work it could be an option, but it won’t provide the depth of human dialog and synthesis.

As the saying goes, nine pregnant people can’t deliver a baby in one month, and we need to discern where AI can reasonably accelerate processes and where we just need all the human time to think and process to get to the depth of the insights.