TellWhisper: Tell Whisper Who Speaks When
Anonymous Authors
1. ABSTRACT
Multi-speaker automatic speech recognition (MASR) aims to predict ''who spoke when and what '' from multi-speaker speech, a key technology for multi-party dialogue understanding. However, most existing approaches decouple temporal modeling and speaker modeling when addressing ''when '' and ''who '': some inject speaker cues before encoding (e.g., speaker masking), which can cause irreversible information loss; others fuse identity by mixing speaker posteriors after encoding, which may entangle acoustic content with speaker identity. This separation is brittle under rapid turn-taking and overlapping speech, often leading to degraded performance. To address these limitations, we propose TellWhisper, a unified framework that jointly models speaker identity and temporal within the speech encoder. Specifically, we design TS-RoPE, a time-speaker rotary positional encoding: time coordinates are derived from frame indices, while speaker coordinates are derived from speaker activity and pause cues. By applying region-specific rotation angles, the model explicitly captures per-speaker continuity, speaker-turn transitions, and state dynamics, enabling the attention mechanism to simultaneously attend to ''when '' and ''who ''. Moreover, to estimate frame-level speaker activity, we develop Hyper-SD, which casts speaker classification in hyperbolic space to enhance inter-class separation and refine speaker-activity estimates. Extensive experiments demonstrate the effectiveness of the proposed approach. The project webpage is available at https://tellwhisper.github.io.
2. METHOD
3. EXPERIMENTS
3.1 Examples of TellWhisper on Different Datasets
Sample #1:
Speech (from Libri2Mix)



Ground Truth
spk_1 <|0.00|> <|5.32|> i gave him a little brandy and left him collapsed in a chair while i made a most careful examination of the room
spk_2 <|0.00|> <|4.72|> not at all you are on the contrary most agreeable to me
TellWhisper
spk_1 <|0.00|> <|5.32|> i gave him a little andy and left him collapsed in a chair where i made a most careful examination of the room
spk_2 <|0.00|> <|4.82|> not at all you are on the contrary most agreeable to me
Sample #2:
Speech (from Libri2Mix)



Ground Truth
spk_1 <|0.00|> <|16.12|> at that epoch of pristine simplicity however matters of even slighter public interest and of far less intrinsic weight than the welfare of hester and her child were strangely mixed up with the deliberations of legislators and acts of state
spk_2 <|0.00|> <|8.05|> they must have some characteristic which makes us regard them as referring to more or less remote portions of the past
TellWhisper
spk_1 <|0.00|> <|16.12|> at that epoch of pristine simplicity however matters of even slighter public interest and of far less intrinsic weight than the welfare of hester and her child were strangely mixed up with the deliberations of legislators and acts of state
spk_2 <|0.00|> <|8.14|> they must have some characteristic which makes us regard them as referring to a more or less remote portion of the past
Sample #3:
Speech (from AMI)



Ground Truth
spk_1 <|0.00|> <|1.37|> funky sh stuff like that
spk_2 <|0.59|> <|6.48|> wonder how much of the meetings is talking about the stuff at the meetings yeah
spk_3 <|3.21|> <|5.01|> yeah exactly yeah yeah yeah
spk_1 <|3.38|> <|11.76|> yeah yeah look at all this stuff man okay right
spk_4 <|8.23|> <|8.58|> okay
spk_3 <|12.83|> <|13.29|> should we
spk_4 <|11.95|> <|14.85|> so what do we need to talk about
spk_1 <|14.61|> <|15.11|> has anybody
spk_3 <|14.84|> <|24.85|> well should we just go around and everyone says what they d what they've been doing how far they've well i hope so
spk_1 <|18.71|> <|23.32|> has anybody done anything got
spk_2 <|21.10|> <|22.92|> not a lot no
spk_4 <|25.39|> <|25.78|> yeah
spk_2 <|25.13|> <|29.40|> hmm okay sounds like you've done some stuff so
spk_1 <|28.06|> <|29.53|> you better start
TellWhisper
spk_1 <|0.00|> <|1.32|> funky stuff like that
spk_2 <|0.58|> <|6.39|> i wonder how much of the meetings is talking about the stuff at the meetings
spk_3 <|3.22|> <|5.06|> yeah exactly yeah yeah yeah
spk_1 <|4.58|> <|11.76|> yeah look at all this stuff man okay right
spk_3 <|12.81|> <|13.53|> should we
spk_4 <|11.93|> <|14.88|> so we need to talk about it
spk_1 <|14.61|> <|15.11|> has anybody
spk_3 <|14.84|> <|24.81|> well should we just go around and everyone says what they what they've been doing how far they've well i hope so
spk_1 <|18.21|> <|23.12|> has anybody done anything got
spk_2 <|21.16|> <|29.55|> not a lot no yeah okay sounds like you've done some stuff so
Sample #4:
Speech (from AMI)



Ground Truth
spk_1 <|0.00|> <|3.66|> the double click will do that and so save them the trouble of right clicking and choosing the item on the menu
spk_2 <|0.02|> <|0.86|> yeah
spk_2 <|3.74|> <|7.89|> i don't see there's anything obvious that was that would be able you know that would be
spk_3 <|3.85|> <|4.21|> yeah
spk_1 <|7.00|> <|10.28|> well it might come to us as we start playing with it
spk_3 <|10.36|> <|10.66|> yeah
spk_4 <|10.47|> <|12.54|> what was that i didn't quite understand do you mean
spk_1 <|12.39|> <|23.70|> um like we have the the right click menu you'll have the menu right click it but instead of like have a default double click so if there's a de so it'll be a choice from the menu but yeah exactly yeah
spk_3 <|12.81|> <|13.52|> uh just let th
spk_4 <|14.63|> <|15.07|> yeah
spk_4 <|19.16|> <|25.01|> oh so and it figures out what's the most common use to double click oh okay
spk_3 <|19.61|> <|19.89|> yep
spk_3 <|22.57|> <|22.81|> yeah
spk_3 <|24.65|> <|28.47|> like i don't know show the speaker characterisation for instance or just
spk_1 <|27.03|> <|27.33|> yeah
spk_2 <|27.30|> <|29.99|> yeah but then then there's the problem with a lot of windows
spk_4 <|27.52|> <|27.88|> yeah
spk_3 <|28.47|> <|28.79|> i don't know
TellWhisper
spk_1 <|0.00|> <|3.89|> the double click won't do that and so save them the trouble of right clicking and choosing the item on the menu
spk_2 <|0.04|> <|0.86|> yeah
spk_2 <|3.76|> <|7.72|> i don't see there's anything obvious that wou that would be able you know that would be
spk_3 <|3.76|> <|4.20|> yeah
spk_1 <|7.06|> <|10.34|> well you might come to us and start playing with it
spk_3 <|10.30|> <|10.66|> yeah
spk_4 <|10.47|> <|12.54|> what was that other coins then do you mean is it yeah
spk_1 <|12.42|> <|24.82|> um like we have the the right click menu if you have the menu right clicking but instead of like have a default double click so there's still be a choice from the menu but yeah exactly yeah
spk_3 <|12.86|> <|13.62|> uh just let the
spk_4 <|19.12|> <|25.00|> oh sorry and it figures out what's the most common used double click okay okay
spk_3 <|22.55|> <|22.78|> yeah
spk_3 <|24.63|> <|28.48|> like i don't know show the speaker characterisation for instance or just
spk_2 <|27.31|> <|29.98|> yeah but then then there's the problem with a lot of windows
spk_4 <|27.53|> <|27.98|> yeah
spk_3 <|28.27|> <|28.56|> i don't know
Sample #5:
Speech (from NotSoFar)



Ground Truth
spk_1 <|0.00|> <|2.96|> know we forget about them so what do you say uh linda
spk_2 <|2.43|> <|3.38|> forget
spk_2 <|3.65|> <|4.27|> about whom
spk_1 <|4.00|> <|4.88|> the older people
spk_2 <|4.87|> <|6.85|> oh the older people yes yes of course
spk_1 <|6.61|> <|9.59|> they need uh we need we need it's not only children and uh
spk_2 <|7.18|> <|7.81|> of course
spk_2 <|9.59|> <|10.06|> right
spk_1 <|9.61|> <|11.87|> and greenery it's also uh their place
spk_3 <|11.09|> <|13.06|> so what would you suggest rachel
spk_1 <|12.54|> <|13.67|> what do you say ron
spk_4 <|13.96|> <|23.50|> well i think yeah i think about the children first you know that uh that there needs to be playgrounds there needs to be different areas where they can run around
spk_1 <|22.54|> <|22.89|> we
spk_1 <|23.70|> <|24.06|> this
spk_4 <|23.90|> <|24.32|> um
spk_1 <|24.08|> <|29.81|> but this is automatic this is what we do no matter of what what i was trying to to bring in here is
TellWhisper
spk_1 <|0.00|> <|2.90|> know we forget about that so what do you say uh inter
spk_2 <|2.50|> <|4.24|> forget about who
spk_1 <|3.98|> <|5.00|> the older people
spk_2 <|4.88|> <|7.66|> oh the older people yes yes of course of course
spk_1 <|6.62|> <|13.66|> they need uh we need we need it's not only children and uh and greenery it's also uh their place what do you say wrong
spk_2 <|9.56|> <|9.88|> right
spk_3 <|11.18|> <|13.04|> so what would you suggest rachel
spk_4 <|13.96|> <|24.32|> uh i think yeah i think about the children first you know that uh that there needs to be playgrounds there needs to be uh different areas where they can run around um
spk_1 <|22.66|> <|29.80|> we this but this is automatic this is what we do no matter of what what i was trying to to bring in here is
Sample #6:
Speech (from NotSoFar)



Ground Truth
spk_1 <|0.00|> <|9.83|> say ron set a list and uh and uh submit it on on this meeting to the uh to the people that really do uh put it together
spk_2 <|9.81|> <|15.49|> yeah i mean let's think about the environment that we're in because you know it's uh it's a hot area
spk_3 <|9.97|> <|10.44|> mmm
spk_1 <|15.54|> <|15.97|> yes
spk_2 <|15.84|> <|21.31|> you know having during you know uh an area during the summer that can be used let's say with
spk_4 <|21.27|> <|22.43|> a shaded
spk_2 <|21.37|> <|21.40|> uh
spk_1 <|21.73|> <|23.35|> absolutely the shade
spk_2 <|21.76|> <|23.93|> with with shade and but also
spk_3 <|21.77|> <|22.45|> shade
spk_3 <|22.62|> <|24.66|> shade is very important yeah
spk_2 <|24.37|> <|26.87|> but also with with water so
spk_4 <|24.96|> <|26.12|> also water
spk_3 <|25.02|> <|25.31|> yeah
spk_3 <|26.11|> <|26.50|> mmm
spk_1 <|26.39|> <|28.99|> and and pets we didn't talk about the pets
spk_4 <|26.45|> <|27.07|> yeah
spk_3 <|28.65|> <|29.67|> of course there
spk_1 <|29.34|> <|29.94|> this is
TellWhisper
spk_1 <|0.00|> <|9.86|> say ron uh center list and uh and uh submit it on this meeting to the uh to the people that really do uh put it together
spk_2 <|9.74|> <|15.50|> yeah and let's think about the environment that we're in because you know it's uh it's a hot area
spk_3 <|10.14|> <|10.60|> mmm
spk_1 <|15.52|> <|16.04|> yes
spk_2 <|15.90|> <|21.32|> you know having during you know uh an area during the summer that can be used let's say with
spk_4 <|21.32|> <|22.20|> the shaded
spk_2 <|21.37|> <|23.94|> uh with shade and but also
spk_1 <|21.72|> <|23.35|> absolutely the shade
spk_3 <|21.77|> <|24.68|> shade shade is very important yeah
spk_2 <|24.33|> <|26.78|> but also with with water so
spk_4 <|24.94|> <|26.11|> also water
spk_3 <|25.12|> <|25.36|> yeah
spk_1 <|26.41|> <|29.94|> and the pets we didn't talk about the pets this is
spk_3 <|28.65|> <|29.67|> of course there
Sample #7:
Speech (from LibriCSS)



Ground Truth
spk_1 <|0.00|> <|8.30|> again i thank you this incident i suppose will be renewed no more if i live to be an old woman i shall remember it thirty years hence as a bright dream
spk_2 <|11.20|> <|16.15|> she is under sail but she is count timascheff's yacht he was right
spk_3 <|19.06|> <|22.29|> i'll try if i know all the things i used to know
TellWhisper
spk_1 <|0.00|> <|8.35|> again i thank you this incident i suppose will be renewed no more if i live to be an old woman i shall remember it thirty years hence as a bright dream
spk_2 <|11.13|> <|16.20|> she is under sail but she is count timascheff's yacht he was right
spk_3 <|19.03|> <|22.11|> i'll try if i know all the things i used to know
Sample #8:
Speech (from LibriCSS)



Ground Truth
spk_1 <|0.00|> <|6.38|> come come said holmes kindly it is human to err and at least no one can accuse you of being a callous criminal
spk_2 <|6.26|> <|11.03|> what is your country olaf have you always been a thrall the thrall's eyes flashed
spk_2 <|11.10|> <|13.09|> i painted the eyes red for anger
spk_3 <|12.73|> <|26.39|> before the settlement of terms the administration must be possessed entirely by the parliaments of both kingdoms and how incompatible that scheme with the liberty of the king is easily imagined
spk_4 <|24.51|> <|29.96|> i wish i hadn't cried so much said alice as she swam about trying to find her way out
TellWhisper
spk_1 <|0.00|> <|6.19|> come come said holmes kindly it is human to err and at least no one can accuse you of being a callous criminal
spk_2 <|6.25|> <|11.13|> what is your country olaf have you always been a thrall the thrall's eyes flashed
spk_2 <|11.10|> <|13.12|> i painted the eyes red for anger
spk_3 <|12.78|> <|26.28|> before the settlement of terms the administration must be possessed entirely by the parliaments of both kingdoms and how incompatible that scheme with the liberty of the king is easily imagined
spk_4 <|24.53|> <|29.94|> i wish hadn't cried so much said alice as she swam about trying to find her way out
3.2 Examples of Hyper-SD on Different Datasets
Sample #1:
Speech (from AISHELL4)



Ground Truth
Hyper-SD
Sample #2:
Speech (from AISHELL4)



Ground Truth
Hyper-SD
Sample #3:
Speech (from AliMeeting)



Ground Truth
Hyper-SD
Sample #4:
Speech (from AliMeeting)



Ground Truth
Hyper-SD
Sample #5:
Speech (from AMI)



Ground Truth
Hyper-SD
Sample #6:
Speech (from AMI)



Ground Truth
Hyper-SD
Sample #7:
Speech (from MSDWild)



Ground Truth
Hyper-SD
Sample #8:
Speech (from MSDWild)



Ground Truth
Hyper-SD
Sample #9:
Speech (from RAMC)



Ground Truth
Hyper-SD
Sample #10:
Speech (from RAMC)



Ground Truth
Hyper-SD
Sample #11:
Speech (from VoxConverse)



Ground Truth
Hyper-SD
Sample #12:
Speech (from VoxConverse)



Ground Truth
Hyper-SD