How do deal with long multi person audio recording, for example meeting record?


I want to make a software to deal with some long multi person audio recording, for example, a meeting recording, the “deal with” means:
do some speaker seperation.
do some ASR
do some summerization for each speaker’s content.
do some summerization for the whole meeting content.

Is it possible to do such software now? and any suggestion?


It is possible, but not out of the box. This is an extremely challenging condition, it might certainly work pretty poorly (but this will be the case with any existing toolkit or model). You already stated the different steps, you just need to combine them in your own recipe now. We can help you along the way if you encounter difficulties in implementing one of these steps. But every single step is challenging on its own. For instance, speaker separation with overlapping speech is hard …

Thanks for the reply!

I notice this is a challenging problem, and I think maybe there will be more and more this kinds of softwares in the future. Maybe we can name these kind of software as “deeplearning based software”.

Deeplearning based software can be mode of several modules, each module can be a function (or an algorithm), it can be a neural network or hand made coding, if it is a NN, I think it can be living, living means it can imporve itself on the fly.

Please refer this link:

And I borrowed some words from it:
Continual learning systems adapt models to the evolving world by repeatedly retraining them on newly curated production data.

As I understand, for this kinds of software, we can build the software strcture even some technique is not mature enough, for example, speaker separetion. But because it is a living module, it can improve itself on the fly. for example, if we customize the software for a small group of people, we can use the meeting recording of these people to retrain the network continuasly, make the network to overfit a little while.

I think NN based software is different with ordinary software which is same function for everyone, NN based software is personized, in the future, if you buy a robot from a shop, the robot will continual improve skills by comunication with you.

Just a little thinking and I think this direction is interesting :grinning:


1 Like