1st generation of SpeechLMM models, capable of ingesting video, audio and text and generate text as output. From the Meetween consortium (meetween.eu)