LIST
1. ABSTRACT
Multi-speaker automatic speech recognition (MASR) aims to predict ''who spoke when and what '' from multi-speaker speech, a key technology for multi-party dialogue understanding. However, most existing approaches decouple temporal modeling and speaker modeling when addressing ''when '' and ''who '': some inject speaker cues before encoding (e.g., speaker masking), which can cause irreversible information loss; others fuse identity by mixing speaker posteriors after encoding, which may entangle acoustic content with speaker identity. This separation is brittle under rapid turn-taking and overlapping speech, often leading to degraded performance. To address these limitations, we propose TellWhisper, a unified framework that jointly models speaker identity and temporal within the speech encoder. Specifically, we design TS-RoPE, a time-speaker rotary positional encoding: time coordinates are derived from frame indices, while speaker coordinates are derived from speaker activity and pause cues. By applying region-specific rotation angles, the model explicitly captures per-speaker continuity, speaker-turn transitions, and state dynamics, enabling the attention mechanism to simultaneously attend to ''when '' and ''who ''. Moreover, to estimate frame-level speaker activity, we develop Hyper-SD, which casts speaker classification in hyperbolic space to enhance inter-class separation and refine speaker-activity estimates. Extensive experiments demonstrate the effectiveness of the proposed approach. The project webpage is available at https://tellwhisper.github.io.
Sample #1:
| Speech (from Libri2Mix) | |||
|---|---|---|---|
| Ground Truth | ||||
|---|---|---|---|---|
|
spk_1 <|0.00|> <|5.32|> i gave him a little brandy and left him collapsed in a chair while i made a most careful examination of the room
spk_2 <|0.00|> <|4.72|> not at all you are on the contrary most agreeable to me |
| TellWhisper | ||||
|---|---|---|---|---|
|
spk_1 <|0.00|> <|5.32|> i gave him a little andy and left him collapsed in a chair where i made a most careful examination of the room
spk_2 <|0.00|> <|4.82|> not at all you are on the contrary most agreeable to me |
Sample #2:
| Speech (from Libri2Mix) | |||
|---|---|---|---|
| Ground Truth | ||||
|---|---|---|---|---|
| spk_1 <|0.00|> <|16.12|> at that epoch of pristine simplicity however matters of even slighter public interest and of far less intrinsic weight than the welfare of hester and her child were strangely mixed up with the deliberations of legislators and acts of state spk_2 <|0.00|> <|8.05|> they must have some characteristic which makes us regard them as referring to more or less remote portions of the past |
| TellWhisper | ||||
|---|---|---|---|---|
| spk_1 <|0.00|> <|16.12|> at that epoch of pristine simplicity however matters of even slighter public interest and of far less intrinsic weight than the welfare of hester and her child were strangely mixed up with the deliberations of legislators and acts of state spk_2 <|0.00|> <|8.14|> they must have some characteristic which makes us regard them as referring to a more or less remote portion of the past |
Sample #3:
| Speech (from AMI) | |||
|---|---|---|---|
| Ground Truth | ||||
|---|---|---|---|---|
|
spk_1 <|0.00|> <|1.37|> funky sh stuff like that spk_2 <|0.59|> <|6.48|> wonder how much of the meetings is talking about the stuff at the meetings yeah spk_3 <|3.21|> <|5.01|> yeah exactly yeah yeah yeah spk_1 <|3.38|> <|11.76|> yeah yeah look at all this stuff man okay right spk_4 <|8.23|> <|8.58|> okay spk_3 <|12.83|> <|13.29|> should we spk_4 <|11.95|> <|14.85|> so what do we need to talk about spk_1 <|14.61|> <|15.11|> has anybody spk_3 <|14.84|> <|24.85|> well should we just go around and everyone says what they d what they've been doing how far they've well i hope so spk_1 <|18.71|> <|23.32|> has anybody done anything got spk_2 <|21.10|> <|22.92|> not a lot no spk_4 <|25.39|> <|25.78|> yeah spk_2 <|25.13|> <|29.40|> hmm okay sounds like you've done some stuff so spk_1 <|28.06|> <|29.53|> you better start |
| TellWhisper |
spk_1 <|0.00|> <|1.32|> funky stuff like that spk_2 <|0.58|> <|6.39|> i wonder how much of the meetings is talking about the stuff at the meetings spk_3 <|3.22|> <|5.06|> yeah exactly yeah yeah yeah spk_1 <|4.58|> <|11.76|> yeah look at all this stuff man okay right spk_3 <|12.81|> <|13.53|> should we spk_4 <|11.93|> <|14.88|> so we need to talk about it spk_1 <|14.61|> <|15.11|> has anybody spk_3 <|14.84|> <|24.81|> well should we just go around and everyone says what they what they've been doing how far they've well i hope so spk_1 <|18.21|> <|23.12|> has anybody done anything got spk_2 <|21.16|> <|29.55|> not a lot no yeah okay sounds like you've done some stuff so |
|---|
Sample #4:
| Speech (from AMI) | |||
|---|---|---|---|
| Ground Truth | ||||
|---|---|---|---|---|
|
spk_1 <|0.00|> <|3.66|> the double click will do that and so save them the trouble of right clicking and choosing the item on the menu spk_2 <|0.02|> <|0.86|> yeah spk_2 <|3.74|> <|7.89|> i don't see there's anything obvious that was that would be able you know that would be spk_3 <|3.85|> <|4.21|> yeah spk_1 <|7.00|> <|10.28|> well it might come to us as we start playing with it spk_3 <|10.36|> <|10.66|> yeah spk_4 <|10.47|> <|12.54|> what was that i didn't quite understand do you mean spk_1 <|12.39|> <|23.70|> um like we have the the right click menu you'll have the menu right click it but instead of like have a default double click so if there's a de so it'll be a choice from the menu but yeah exactly yeah spk_3 <|12.81|> <|13.52|> uh just let th spk_4 <|14.63|> <|15.07|> yeah spk_4 <|19.16|> <|25.01|> oh so and it figures out what's the most common use to double click oh okay spk_3 <|19.61|> <|19.89|> yep spk_3 <|22.57|> <|22.81|> yeah spk_3 <|24.65|> <|28.47|> like i don't know show the speaker characterisation for instance or just spk_1 <|27.03|> <|27.33|> yeah spk_2 <|27.30|> <|29.99|> yeah but then then there's the problem with a lot of windows spk_4 <|27.52|> <|27.88|> yeah spk_3 <|28.47|> <|28.79|> i don't know |
| TellWhisper |
spk_1 <|0.00|> <|3.89|> the double click won't do that and so save them the trouble of right clicking and choosing the item on the menu spk_2 <|0.04|> <|0.86|> yeah spk_2 <|3.76|> <|7.72|> i don't see there's anything obvious that wou that would be able you know that would be spk_3 <|3.76|> <|4.20|> yeah spk_1 <|7.06|> <|10.34|> well you might come to us and start playing with it spk_3 <|10.30|> <|10.66|> yeah spk_4 <|10.47|> <|12.54|> what was that other coins then do you mean is it yeah spk_1 <|12.42|> <|24.82|> um like we have the the right click menu if you have the menu right clicking but instead of like have a default double click so there's still be a choice from the menu but yeah exactly yeah spk_3 <|12.86|> <|13.62|> uh just let the spk_4 <|19.12|> <|25.00|> oh sorry and it figures out what's the most common used double click okay okay spk_3 <|22.55|> <|22.78|> yeah spk_3 <|24.63|> <|28.48|> like i don't know show the speaker characterisation for instance or just spk_2 <|27.31|> <|29.98|> yeah but then then there's the problem with a lot of windows spk_4 <|27.53|> <|27.98|> yeah spk_3 <|28.27|> <|28.56|> i don't know |
|---|
Sample #5:
| Speech (from NotSoFar) | |||
|---|---|---|---|
| Ground Truth | ||||
|---|---|---|---|---|
|
spk_1 <|0.00|> <|2.96|> know we forget about them so what do you say uh linda spk_2 <|2.43|> <|3.38|> forget spk_2 <|3.65|> <|4.27|> about whom spk_1 <|4.00|> <|4.88|> the older people spk_2 <|4.87|> <|6.85|> oh the older people yes yes of course spk_1 <|6.61|> <|9.59|> they need uh we need we need it's not only children and uh spk_2 <|7.18|> <|7.81|> of course spk_2 <|9.59|> <|10.06|> right spk_1 <|9.61|> <|11.87|> and greenery it's also uh their place spk_3 <|11.09|> <|13.06|> so what would you suggest rachel spk_1 <|12.54|> <|13.67|> what do you say ron spk_4 <|13.96|> <|23.50|> well i think yeah i think about the children first you know that uh that there needs to be playgrounds there needs to be different areas where they can run around spk_1 <|22.54|> <|22.89|> we spk_1 <|23.70|> <|24.06|> this spk_4 <|23.90|> <|24.32|> um spk_1 <|24.08|> <|29.81|> but this is automatic this is what we do no matter of what what i was trying to to bring in here is |
| TellWhisper |
spk_1 <|0.00|> <|2.90|> know we forget about that so what do you say uh inter spk_2 <|2.50|> <|4.24|> forget about who spk_1 <|3.98|> <|5.00|> the older people spk_2 <|4.88|> <|7.66|> oh the older people yes yes of course of course spk_1 <|6.62|> <|13.66|> they need uh we need we need it's not only children and uh and greenery it's also uh their place what do you say wrong spk_2 <|9.56|> <|9.88|> right spk_3 <|11.18|> <|13.04|> so what would you suggest rachel spk_4 <|13.96|> <|24.32|> uh i think yeah i think about the children first you know that uh that there needs to be playgrounds there needs to be uh different areas where they can run around um spk_1 <|22.66|> <|29.80|> we this but this is automatic this is what we do no matter of what what i was trying to to bring in here is |
|---|
Sample #6:
| Speech (from NotSoFar) | |||
|---|---|---|---|
| Ground Truth | ||||
|---|---|---|---|---|
|
spk_1 <|0.00|> <|9.83|> say ron set a list and uh and uh submit it on on this meeting to the uh to the people that really do uh put it together spk_2 <|9.81|> <|15.49|> yeah i mean let's think about the environment that we're in because you know it's uh it's a hot area spk_3 <|9.97|> <|10.44|> mmm spk_1 <|15.54|> <|15.97|> yes spk_2 <|15.84|> <|21.31|> you know having during you know uh an area during the summer that can be used let's say with spk_4 <|21.27|> <|22.43|> a shaded spk_2 <|21.37|> <|21.40|> uh spk_1 <|21.73|> <|23.35|> absolutely the shade spk_2 <|21.76|> <|23.93|> with with shade and but also spk_3 <|21.77|> <|22.45|> shade spk_3 <|22.62|> <|24.66|> shade is very important yeah spk_2 <|24.37|> <|26.87|> but also with with water so spk_4 <|24.96|> <|26.12|> also water spk_3 <|25.02|> <|25.31|> yeah spk_3 <|26.11|> <|26.50|> mmm spk_1 <|26.39|> <|28.99|> and and pets we didn't talk about the pets spk_4 <|26.45|> <|27.07|> yeah spk_3 <|28.65|> <|29.67|> of course there spk_1 <|29.34|> <|29.94|> this is |
| TellWhisper |
spk_1 <|0.00|> <|9.86|> say ron uh center list and uh and uh submit it on this meeting to the uh to the people that really do uh put it together spk_2 <|9.74|> <|15.50|> yeah and let's think about the environment that we're in because you know it's uh it's a hot area spk_3 <|10.14|> <|10.60|> mmm spk_1 <|15.52|> <|16.04|> yes spk_2 <|15.90|> <|21.32|> you know having during you know uh an area during the summer that can be used let's say with spk_4 <|21.32|> <|22.20|> the shaded spk_2 <|21.37|> <|23.94|> uh with shade and but also spk_1 <|21.72|> <|23.35|> absolutely the shade spk_3 <|21.77|> <|24.68|> shade shade is very important yeah spk_2 <|24.33|> <|26.78|> but also with with water so spk_4 <|24.94|> <|26.11|> also water spk_3 <|25.12|> <|25.36|> yeah spk_1 <|26.41|> <|29.94|> and the pets we didn't talk about the pets this is spk_3 <|28.65|> <|29.67|> of course there |
|---|
Sample #7:
| Speech (from LibriCSS) | |||
|---|---|---|---|
| Ground Truth | ||||
|---|---|---|---|---|
|
spk_1 <|0.00|> <|8.30|> again i thank you this incident i suppose will be renewed no more if i live to be an old woman i shall remember it thirty years hence as a bright dream spk_2 <|11.20|> <|16.15|> she is under sail but she is count timascheff's yacht he was right spk_3 <|19.06|> <|22.29|> i'll try if i know all the things i used to know |
| TellWhisper |
spk_1 <|0.00|> <|8.35|> again i thank you this incident i suppose will be renewed no more if i live to be an old woman i shall remember it thirty years hence as a bright dream spk_2 <|11.13|> <|16.20|> she is under sail but she is count timascheff's yacht he was right spk_3 <|19.03|> <|22.11|> i'll try if i know all the things i used to know |
|---|
Sample #8:
| Speech (from LibriCSS) | |||
|---|---|---|---|
| Ground Truth | ||||
|---|---|---|---|---|
|
spk_1 <|0.00|> <|6.38|> come come said holmes kindly it is human to err and at least no one can accuse you of being a callous criminal spk_2 <|6.26|> <|11.03|> what is your country olaf have you always been a thrall the thrall's eyes flashed spk_2 <|11.10|> <|13.09|> i painted the eyes red for anger spk_3 <|12.73|> <|26.39|> before the settlement of terms the administration must be possessed entirely by the parliaments of both kingdoms and how incompatible that scheme with the liberty of the king is easily imagined spk_4 <|24.51|> <|29.96|> i wish i hadn't cried so much said alice as she swam about trying to find her way out |
| TellWhisper |
spk_1 <|0.00|> <|6.19|> come come said holmes kindly it is human to err and at least no one can accuse you of being a callous criminal spk_2 <|6.25|> <|11.13|> what is your country olaf have you always been a thrall the thrall's eyes flashed spk_2 <|11.10|> <|13.12|> i painted the eyes red for anger spk_3 <|12.78|> <|26.28|> before the settlement of terms the administration must be possessed entirely by the parliaments of both kingdoms and how incompatible that scheme with the liberty of the king is easily imagined spk_4 <|24.53|> <|29.94|> i wish hadn't cried so much said alice as she swam about trying to find her way out |
|---|
Sample #1:
| Speech (from AISHELL4) | |||
|---|---|---|---|
| Ground Truth |
|
|---|
| Hyper-SD |
|
|---|
Sample #2:
| Speech (from AISHELL4) | |||
|---|---|---|---|
| Ground Truth |
|
|---|
| Hyper-SD |
|
|---|
Sample #3:
| Speech (from AliMeeting) | |||
|---|---|---|---|
| Ground Truth |
|
|---|
| Hyper-SD |
|
|---|
Sample #4:
| Speech (from AliMeeting) | |||
|---|---|---|---|
| Ground Truth |
|
|---|
| Hyper-SD |
|
|---|
Sample #5:
| Speech (from AMI) | |||
|---|---|---|---|
| Ground Truth |
|
|---|
| Hyper-SD |
|
|---|
Sample #6:
| Speech (from AMI) | |||
|---|---|---|---|
| Ground Truth |
|
|---|
| Hyper-SD |
|
|---|
Sample #7:
| Speech (from MSDWild) | |||
|---|---|---|---|
| Ground Truth |
|
|---|
| Hyper-SD |
|
|---|
Sample #8:
| Speech (from MSDWild) | |||
|---|---|---|---|
| Ground Truth |
|
|---|
| Hyper-SD |
|
|---|
Sample #9:
| Speech (from RAMC) | |||
|---|---|---|---|
| Ground Truth |
|
|---|
| Hyper-SD |
|
|---|
Sample #10:
| Speech (from RAMC) | |||
|---|---|---|---|
| Ground Truth |
|
|---|
| Hyper-SD |
|
|---|
Sample #11:
| Speech (from VoxConverse) | |||
|---|---|---|---|
| Ground Truth |
|
|---|
| Hyper-SD |
|
|---|
Sample #12:
| Speech (from VoxConverse) | |||
|---|---|---|---|
| Ground Truth |
|
|---|
| Hyper-SD |
|
|---|