creating/Adding time stamps of Each spoken word and when each person starts/stops speaking

I wanted to know if it’s possible from speech brain to read an audio file and to provide the exact time of when each word is being said and also the time of when each person starts to speak or stops speaking and providing a json file that is similar to this using speech brain:

{
  "results": [
        {
          "timestamps": [
            [
              "hello",
              0.68,
              1.19
            ],
            [
              "yeah",
              1.47,
              1.91
            ],
            [
              "yeah",
              1.96,
              2.12
            ],
            [
              "how's",
              2.12,
              2.59
            ],
            [
              "Billy",
              2.59,
              3.17
            ],
            [
              "good",
              4.01,
              4.30
            ]
          ]
          "transcript": "hello yeah yeah how's Billy good "
        }
  ],
  "speaker_labels": [
    {
      "from": 0.68,
      "to": 1.19,
      "speaker": 2
    },
    {
      "from": 1.47,
      "to": 1.93,
      "speaker": 1
    },
    {
      "from": 1.96,
      "to": 2.12,
      "speaker": 2
    },
    {
      "from": 2.12,
      "to": 2.59,
      "speaker": 2
    },
    {
      "from": 2.59,
      "to": 3.17,
      "speaker": 2
    },
    {
      "from": 4.01,
      "to": 4.30,
      "speaker": 1
    }
  ]
}

Thanks!

  1. Yes if you have the transcription (see ctc segmentation)
  2. Yes but you’ll had to do your own recipe as this is a complex task (Speaker Diarization).
1 Like