Memory Squared
Memory Squared
Blog
Business&Products

Answering the Call to Help Telenor Launch Their “Missed Calls” Campaign – Part 2

Rafał Rybarczyk

Memory Shared is a new series of articles where we explore some of the most interesting Case Studies of client projects. Like actual memories, each project memory we share is multilayered so we’ve decided to make each memory a two-parter. Part 1 explores the business context and product strategy behind each project, while Part 2 explores the technical solutions we came up with featuring deep dives into the frameworks, work-arounds and coding heroics of our team. Enjoy the memories and don’t hesitate to share! 


In Part 2 of Memory Shared: Telenor I’m excited to dive a bit deeper into the technical side of the solution we came up with for the Missed Calls campaign.  As you may remember from Part 1, Nord DBB approached us with a deceptively simple request: build a webpage to let their mobile carrier client Telenor’s mobile users enter their phone number and voicemail PIN and then retrieve their voicemail messages as downloadable audio files.  On paper it seemed simple, but if you’ve ever tried to use a mobile voicemail system, you’ll understand what happened next. 

“Missed Calls” Project Overview  

Client:  Telenor & Nord DDB 
Year: 2024 
Project Description: A Frontend Web App and Backend Voicemail API to record Voicemail Messages for Telonor Sweden Users
Team Involved: Tech Lead, Frontend Developer, Backend Developer, Project Manager 
Technologies: node.js, MySQL, custom Voice Interface Integration (Telnyx), React.js 

Answering the Call: Understanding the Problem

Once we were fully onboarded to the project we dug into the technical documentation and discovered that Telenor exposed no API, no documentation, and no programmatic access whatsoever. The only interface available was a phone number that users could dial before navigating a Swedish-language Interactive Voice Response (IVR) system and press the digits corresponding to the function. That was the entire integration surface.

Before we could even think about authentication or message retrieval problems, we had to answer a more basic question: Does every phone number have a voicemail inbox? There was no endpoint to check. The only way we could find out was to behave like a human caller. So we started building what we ended up calling a “Probe Call.”  In order to get our probe call system off the ground we needed to figure out how we could place and control phone calls from a backend service? We realized that we needed more than just outbound calling functionality so we gathered the specs and requirements that our probe call system would need.

Requirements & Functionality of our Probe Call System:  

    • Programmatically dial numbers and send DTMF tones
    • Receive real-time webhooks for call state changes
    • Stream live transcription to infer IVR state
    • Record calls for later audio processing
    • Operate within EU data residency requirements

We knew it would be too challenging to develop such a sophisticated telephony solution in house, so we started testing a few third party solutions and we settled on Telnyx as our telephony provider. Telnyx offered a combination of features that mapped directly to our needs: programmable voice with fine-grained call control, reliable webhook delivery, real-time transcription, and built-in call recording — all accessible through a straightforward Node.js SDK. Just as importantly, it provides EU-based infrastructure, which simplifies compliance concerns around handling user data.

A typical voicemail call flow is driven almost entirely by webhooks — the API initiates the call, and everything that follows (answer detection, transcription events, recording availability) arrives asynchronously. That model shaped much of our system’s architecture. From there, the system reacts to events like call.answered, call.machine.detection.ended, call.transcription, and call.hangup — effectively turning every phone call into a distributed, event-driven workflow.

js
interface CreateCallParams {
  callId: string;
  from: string;
}

async createVoicemailProbeCall({ callId, from }: CreateCallParams): Promise<string> {
  const request = {
    connection_id: this.configService.get<string>('telnyx.connectionId'),
    to: this.configService.get<string>('telnyx.voiceMailPhoneNumber'),
    from,
    answering_machine_detection: 'detect_words',
    webhook_url: `${this.configService.get<string>('telnyx.webhookUrl')}/webhook/${callId}`,
  };

  try {
    const response = await this.client.calls.create(request);
    const callSessionId = response.data.call_session_id;

    this.logger.log({
      event: 'call.created',
      provider: 'telnyx',
      callId,
      callSessionId,
      from: this.maskPhoneNumber(from),
    });

    return callSessionId;
  } catch (error) {
    this.logger.error({
      event: 'call.failed',
      provider: 'telnyx',
      callId,
      from: this.maskPhoneNumber(from),
      error: error?.message,
    });

    throw new Error('Failed to initiate voicemail probe call');
  }
}

In effect, our system dialed the carrier’s voicemail number via Telnyx, entered a target phone number, and listened to the response. If the system responded with a prompt asking for a PIN, we knew the voicemail box existed. If it hung up or rerouted elsewhere, it didn’t. In practice, this required wiring together call control, real-time transcription, and a small state machine driven entirely by asynchronous webhooks. 

Probe Call System Webhooks:

    • The call is initiated without a PIN
    • Once the call is answered, we start real-time transcription
    • We send only the phone number via DTMF
    • A transcription listener watches for phrases equivalent to “enter your PIN”
    • If detected, the call is marked as COMPLETED — voicemail exists
    • If the call ends before that, we treat it as NO_VOICEMAIL

This webhooks system may look and sound deterministic. But it wasn’t. The webhook events arrived out of order. Transcription lagged behind the audio. Sometimes the hangup event arrived before the phrase we were waiting for. To handle this, the hangup handler deliberately waited before making a final decision – a small delay that gave the transcription pipeline time to catch up and resolve the race condition.

Even detecting a simple phrase like “enter your PIN” turned out to be unreliable.

Robocall of Duty: Pinpointing Our Solution 

The Telenor voicemail system operated exclusively in the Swedish language, and real-time speech-to-text output was inconsistent enough that exact matching wasn’t a viable solution. The same phrase could appear in dozens of slightly different forms depending on timing, audio quality, and model interpretation.

So we ended up building a two-stage matching layer which included:

    • A dictionary of known phrase variations (dozens of key prompts)
    • A sliding-window scan that prioritized the most recent transcript fragments, where state transitions were most likely to occur

In other words, we weren’t just listening for words – we were inferring system state from the imperfect signals of a live phone call. Only after confirming that a voicemail inbox existed were we able proceed with user authentication.

At that point, the flow became more conventional. The user verified ownership of the phone number via a one-time password sent over SMS using Telnyx. This established the first trust boundary: proving that the caller controls the number they’re attempting to access.

Now at this point, two things were true: the phone number had an active voicemail box, and the user had verified ownership of that number via SMS. What remained was the part that actually unlocked access – the voicemail PIN.

Without the PIN, the system was able to dial into the carrier’s IVR, but it couldn't progress past the first prompt. With it, we gained full access to a user’s private voice messages. That made the PIN one of the most sensitive pieces of data in the entire system. Unlike an OTP, which expires in minutes, a voicemail PIN is a long-lived secret tied directly to personal communication. Mishandling it would effectively expose a user’s private messages. Because of that, we intentionally constrain how we handled the PIN.

Rather than transmitting the PIN in plaintext or relying on naive encryption, we used a hybrid encryption scheme. The client encrypted the PIN using a one-time symmetric key (AES-256-GCM), which was then encrypted with the server’s public key. The server stored only the encrypted payload.

At call time, the PIN was decrypted in memory, sent as DTMF tones, and immediately discarded. It was never logged, never persisted in plaintext, and existed only for the brief moment it was needed. The guiding principle was simple: minimize the lifetime and exposure of sensitive data. Encrypt at the boundary, decrypt only at the moment of use, and never store what you don’t need. With voicemail existence verified, identity confirmed, and credentials handled safely, the real challenge began.
After the Beep: Distributing Outbound Voicemail Calls 

Database-backed Call Distribution 

One practical problem appeared before any transcription or audio processing: outbound calls could not all originate from a single number. At low volume, using one caller ID works. At higher volume, repeatedly dialing the same carrier voicemail system from the same originating number increases the risk of rate limiting, call filtering, or degraded reliability. We needed a way to spread active calls across a pool of outbound numbers.

We solved this problem at the application layer. Each new call queried the database for numbers currently in use, counted the number of active calls assigned to each, and selected the least-loaded candidate. In practice, this was enough: simple, predictable, and with no additional infrastructure needed to operate.

This was not sophisticated scheduling, but it didn’t need to be. The goal was to avoid concentrating traffic on a single outbound number. For our workload, a small amount of database-driven balancing was sufficient.

Turning a Phone Call into a State Machine

Distributing calls across outbound numbers solved one operational problem. It did not solve the harder one: once the call connected, the entire system had to infer state from a stream of asynchronous webhook events and imperfect live transcription.

At that point, a phone call stopped being “just a call” and became an event-driven workflow.

That workflow had to answer questions like:

    • Has the carrier answered?
    • Are we speaking to a machine or a human-operated IVR?
    • Has the system asked for the PIN yet?
    • Was the PIN rejected?
    • Are there new messages?
    • Are there no more messages?
    • Has the call ended normally, or did it fail before we understood why?

None of the answers to these questions came from a single reliable source. They had to be reconstructed from webhook timing, DTMF actions, and noisy Swedish speech-to-text output.

A Webhook-driven Call State Machine 

Telnyx exposes call lifecycle events as webhooks: call.answered, call.machine.detection.ended, call.transcription, call.recording.saved, and call.hangup.

That sounds straightforward until you build on top of it.

As I previously mentioned, these events are asynchronous, can arrive close together, and do not always line up neatly with application expectations. The system therefore treats the call as a state machine, with each webhook advancing or resolving a part of the flow.

At a high level, the service does four things:

    • Verifies webhook authenticity
    • Loads the current persisted call state
    • Dispatches handling based on event type
    • Updates call status and triggers follow-up actions

A simplified version looks like this:

The important part is not the switch statement itself. The important point is that each branch operates under uncertainty. A call can already be ending while transcription is still catching up. A recording webhook can arrive after the call state has already been marked as failed. A phrase indicating success can appear seconds before or after the hangup event that contradicts it.

That is why almost every transition has to be defensive.

Call Control through DTMF & Live Transcription 

Once machine detection was completed, the system had to decide what type of call it was handling. For a voicemail existence check, it started the transcription and sent only the phone number. For a full retrieval call, it started the transcription, began recording, decrypted the user’s PIN, and then sent both the phone number and PIN as a DTMF.

The DTMF sequence itself had to be tuned empirically. The carrier IVR was sensitive to timing, so sending digits too early or too quickly would fail unpredictably. In practice, the most reliable pattern was to add explicit wait intervals before entering the phone number and then once again before entering the PIN.

This kind of logic looks fragile because it is fragile. It is automation against a human-oriented IVR, so timing becomes part of the protocol whether you like it or not.

Swedish Speech Recognition Required a Fuzzy Matching Layer

Live transcription was essential because the entire call flow depended on understanding what the carrier’s voicemail system was saying in real time. In theory, that sounds like a straightforward speech-to-text problem. In practice, it absolutely wasn’t.

The prompts were all in Swedish, delivered over phone audio, and transcribed live while the call was still in progress. The result was noisy enough that exact string matching failed almost immediately. The same phrase could be rendered in numerous different ways depending on timing, line quality, speaker cadence, and model behavior. Instead of treating transcription output as truth, we had to treat it as a noisy signal.

To make it usable, we built a matching layer with three components:

    • A dictionary of known variants for key prompts
    • exact matching for the obvious cases
    • Levenshtein similarity as a fallback for imperfect matches 

We also biased matching toward the end of the transcript first, since the most recent words were usually the most relevant to the current state of the call.

This gave us a reliable way to infer intents such as:

    • Enter PIN
    • Wrong PIN
    • No voicemail
    • No new messages
    • No more new messages
    • New message 

But this is where things get counterintuitive. We didn’t just use transcription for observability or debugging. We used it to control the call itself.

Transcription was not just Metadata - it drove call control.

The transcription stream became the decision engine for the entire system.

Every detected phrase could immediately trigger a state transition or a side effect:

    • typePin during a probe call → voicemail exists
    • wrongPin → mark call as failed and hang up
    • youHaveNoNewMessage → stop processing early
    • youHaveNoMoreNewMessage → finish and optionally trigger follow-up logic
    • newMessage → emit a marker used later in audio processing

This creates an unusual dynamic: the system is making real-time decisions based on data it does not fully trust.

That’s why the fuzzy matching layer isn’t just a convenience — it’s a safety mechanism. Without it, small transcription inconsistencies would translate directly into incorrect behavior. At this point, the system stops looking like a conventional backend service and starts behaving more like a control loop:

listen → interpret → act → repeat 

Each step depends on imperfect input, and each decision affects what happens next in a live phone call.

When a Hangup isn’t the End of the Story 

One subtle problem appeared during voicemail detection. If the probe call ended before the system heard a “type your PIN” prompt, the natural conclusion was that the number had no voicemail box. But in real calls, the final transcription event and the hangup webhook could arrive almost simultaneously.

That created a race: the call might actually be valid, but the hangup handler could mark it as NO_VOICE_MAIL before the transcription handler had time to mark it as successful.

The fix was simple and very intentional: do not decide immediately.

Please Leave a Message:  Turning Call Recordings into Individual Voicemail Files 

Recording the voicemail session was only the first step. The real goal was to turn that single call recording into something the user could actually use: one clean audio file per message, with the carrier’s IVR prompts removed. At first, transcription looked like the obvious way to do this.

In theory, the live transcription stream contained all the markers we needed - phrases like “next message”, “received at”, and other prompts that separate one voicemail from the next. But in practice, transcription was the wrong source of truth for audio segmentation. There were two reasons.

First, recording delivery and transcription events were asynchronous. By the time a transcription fragment arrived, we knew what was said, but not with enough precision to map it back to an exact timestamp in the final recording. The only reliable timestamp we had was when the webhook reached our system, not when the phrase occurred in the call audio.

Second, we deliberately chose not to persist transcription data. The recordings could contain highly sensitive voicemail content, and we wanted to minimize the amount of derived user speech data we stored. That decision improved privacy, but it also meant transcription could not serve as a durable indexing layer for later processing.

So instead of treating transcription as the source for message boundaries, we moved the problem into the audio domain.

Audio Fingerprinting Instead of Transcript Timing 

The breakthrough was a small .NET service built on top of the SoundFingerprinting library.

Rather than trying to reconstruct message boundaries from webhook timing, we created a set of short audio samples for the carrier phrases that reliably appear around message transitions. These included prompts such as:

These samples were fingerprinted in advance and stored in memory. When a full voicemail recording arrived, the service scanned it for matching audio signatures and returned approximate positions for those known prompts.

This gave us something transcription could not: markers anchored directly to the recording itself.

At startup, the fingerprinting service loads a library of short carrier prompt samples, hashes them once, and stores the resulting fingerprints in memory for later matching.

Fingerprints found the prompts. Silence found the cuts. 

Fingerprinting alone was not enough to extract usable voicemail files.

A prompt like “next message” tells you roughly where a message begins, but not where it should be cleanly cut. To solve that, we combined fingerprint matches with silence detection from FFmpeg.

FFmpeg scans the waveform for silence intervals:

We parse the resulting silence_start and silence_end timestamps and use them as natural cut points.

The pipeline worked like this: 

  1. Download the call recording
  2. Fingerprint it against known IVR prompt samples
  3. Run silence detection across the full waveform
  4. Align each detected prompt with the nearest silence boundaries
  5. Trim the audio into one file per voicemail
  6. Upload the final clips to S3

The key idea is simple but important:

    • Fingerprinting tells us what happened and roughly when
    • Silence detection tells us where to cut cleanly

Together, they turn one continuous recording into structured, per-message audio - without relying on fragile transcription timing.

Handling Noisy Matches & Overlapping Samples

The real recordings weren’t clean. Some prompts overlapped. Some short variants matched close to longer ones. Others appeared within seconds of each other and would create duplicate or incorrect boundaries if left untreated.

So we added a normalization layer:

    • sort matches by timestamp
    • deduplicate within a tolerance window
    • prefer longer, more informative samples over short variants
    • ignore anything that isn’t a message boundary

This logic isn’t particularly elegant, but it’s what made the output usable in production.

Why this approach worked 

This ended up being a much better fit than transcript-based segmentation.

The transcription was still useful for real-time control during the call, but it was too asynchronous and too imprecise to cut the final audio reliably. Fingerprinting and silence detection, by contrast, worked directly on the recording we were actually delivering to the user.

That separation turned out to be important:

    • transcription was for live decision-making
    • audio fingerprinting + silence detection was for post-call media extraction 

Once we split the system that way, the pipeline became much more deterministic.

Rafał Rybarczyk

Rafał Rybarczyk

Chief Technology Officer

TRUSTED BY STARTUPS AND ENTERPRISES
IN 16 COUNTRIES WORLDWIDE 🌍

Top Clutch 2024Clutch Champion 2024Clutch Nominee

They are a very agile organisation. Even when faced with complex problems finds solutions and presents realistic plans that it then sticks to.

Paweł Kaczyński

Paweł Kaczyński

Innovation Manager - PHD Media

We are sincerely happy to have found Memory Squared. Their structured and professional way of handling a variety of projects felt to us like they were based around the corner.

Jesper Voois

Jesper Voois

Project Manager - Clear Timber Analytics

Memory Squared has provided us with strong specialists in product design and development. They're a proactive team that integrates smoothly and has strong organisational skills.

Chris Gibb

Chris Gibb

CEO - Iron Code

I’ve had experience working with development companies in European countries, and none of them has come close to the work Memory Squared has done.

Rafael Pedroso

Rafael Pedroso

Creative Lead - Digital Fans

They delivered new features as well as supported us in testing and bug fixing. They take care of the tasks immediately and are quick to react.

Joshua Sprey

Joshua Sprey

CEO - Splash Games

Memory Squared has delivered all the functionalities we wanted. They've met deadlines and stayed on budget. Regarding feedback, they've been very flexible.

Nicole Rok

Nicole Rok

Head of Product - Go Autonomous

The app was completed on-time following our specifications. They seem to have a great rapport with UI, back-end and front-end developers all working together well.

Lewis Allison

Lewis Allison

CEO - UNA Watch

Memory Squared has gone above and beyond what we could ask for. They're very detail-oriented and do extensive testing to provide feedback.

Jeffrey Stevenson

Jeffrey Stevenson

Founder - Blue Mosaic

Their professionalism was evident as they invested considerable effort in comprehending our data tools, proposing innovative ideas to tackle challenges, and enhancing user flow and product functionality.

Syazana Lim

Syazana Lim

Head of Product - SSON Research & Analytics

Let's talk!

We would love to discuss your project, and share our experience. You can book a quick meeting below or just drop us a line at contact@memory2.co.

Contact Us

What we do

We can cover the whole process and digital product development life cycle but we’re also open to sending our developers and designers on a mission to join your team. We’re comfortable working closely with business managers to creatively place digital products into their strategy.

Industries and specializations

We build an extraordinary experience in the multiple 
industries and niches