# Wayland input method project post-mortem

Input methods on Wayland have a bit of a chicken-and-egg problem.

That topic has been my main occupation for the past year, thanks to NLNet, and thanks to the fact that I was a cause of some of the problems in the first place.

But the time has come to move on, and take a less active role. So here's a summary of what I did, what I didn't do, and what I think about the whole topic.

## Chicken and egg

Long story short, the chicken is users and the egg is implementations.

An input method is a very user-facing thing. It's already in the name: it's used for input. It's for the user to interact with the computer. Let the smartphone user type, the Japanese speaker write Kanji. Propose autocompletion from a dictionary. Use the right layout or language. Stay out of the way when unneeded.

For such a user-focused feature, the work I did was exceptionally user-unfocused. Tech demos instead of user testing. There was no one to use what I did every day, or even show me an example of a daily task. What I created isn't enough even for myself. I couldn't empirically evaluate if what I did was an improvement.

Why?

It comes down to manpower and demand.

### Manpower

Most application software people use daily are built using the GTK or Qt toolkits. They most likely communicate with the input method through Mutter or Kwin as the compositor. The third relevant component to me is an Input Method Engine (IME), like fcitx or ibus.

To test my work on someone's daily workflow, I'd have to modify all 3 components and find a user who uses that specific combination. Then take them through the trouble of setting up the patched versions on their system.

It's not impossible, but for a single person, it's infeasible. Right from the start of this project, taught by the lessons implementing text-input in GTK and wlroots, I knew that I'm too stupid to write C/C++ code, and I can't afford the time debugging what I would have written. So I selected a Rust-based stack of smithay, floem, and, uh, my own input method because I couldn't find any (now I know about kime – a bit late, though). Sadly, pretty much no one is using that combo, but at least it made proofs of concept possible.

### Demand

But that's not the only way to get a working stack for users to test. What if someone else did the work? What if the relevant projects decided to implement my experiments themselves, to let their users test? I submitted all the protocols as official experiments in wayland-protocols, after all!

But that hasn't happened.

What gives? My suspicion is that most of the time, there isn't an expert in the area who could do the work. The rest of the time, the improvements aren't seen as worth the extra effort. Users aren't burning developers' effigies, demanding better Kanji. Linux mobile users aren't flooding bug trackers with complaints that the keyboard shoves their apps around for no good reason (it totally does). Linux Mobile is Android, after all, right? (No, it's not.) Android doesn't use Wayland.

A bit more puzzling is the relative silence of the Chinese users. But only a bit. Think about it: they are on the other side of The Great Firewall. The other side of the English-Chinese language barrier. The other side of the chasm between the Western and the Eastern cultures. How can we *not* expect to see a disconnect?

#### Good enough

I think the most important reason driving this lack of demand is that the existing solution is good enough. Input methods on Wayland work 90% of the time. They fall apart only 10% of the time, and everyone knows the last 10% is 90% of the work.

Unfortunately, that last 10% also contains the difference between "kinda sorta works" and "a pleasure to use". The latter has been my motivation for the past year. And common programmer's hubris. *Are you saying, I can't do that???*

So let's take a look at the 90% we already have, and then move on to guess at what the remaining 10% could look like.

## The 90%

After reading this, you'll have a pretty good idea what each feature is necessary for.

I'll mark features relevant to the mobile use case with an **M**, and those relevant for input methods with an **I**.

**IM**: Send text
**IM**: Select a language
**IM**: Announce text purpose
**IM**: Display pre-edit
**IM**: Less ambiguity in corner cases
**IM**: Compatibility
**M**: Display a panel
**M**: Cursor navigation
**I**: Send key events
**I**: Grab keyboard
**I**: Style the pre-edit
**I**: Display a popup

### Common features

**Purpose and language selection** inform the input method what's expected inside the text field. This is such a generic thing, that honestly, it could even be useful with a keyboard and nothing else. Automatic layout switching between windows is already a thing, with the input method protocols it could be per text input field, and determined by the application.

**Pre-edit** is a piece of text that you just typed, but which may change and is not yet considered final.

Chinese, Japanese, Korean (CJK) input methods use them to help composing the characters.

On my Maemo, it's used to display the most likely autocompletion.

Pre-edit text displayed on Maemo

### Features mostly useful for CJK input method

CJK IMEs often show a **popup window** next to the typed text, with several candidate outcomes. This is called a candidate window, which needs to be allowed in the protocol.

Candidate window example

When the IME is paired with a keyboard, it needs to prevent the application from getting some key events. It wouldn't be helpful if, when you typed "[kou](https://en.wikipedia.org/wiki/W%C4%81puro_r%C5%8Dmaji", the text editor received "kou". You want to stop those events, and give them to the IME, which can turn them into "こう". This is what they **keyboard grab** is for. At the same time, you still want to forward events like "Ctrl+C" to the application, so that's why we have **sending key events**.

### Features mostly useful for mobile text input

No matter if you're using a QWERTY input method or MessagEase, on your mobile, you need to display a special **panel** to be able to press the buttons. (If you thought about T9, you can feel smug now. You still need an input method, though.)

On-screen keyboard panel of Stevia

Hitting an exact position in the text with your finger that's 3x the size of a letter on the screen is still hard, though, and application toolkits are failing to make that useable, so **cursor navigation**, which makes it possible for an input method author to come up with something better, remains relevant.

### Complication

That would be about it. Except those features are a bit spread across different protocols, some of which aren't even sanctioned officially. So you can't have both popups and a panel on your system at the same time.

And what about the other 10%?

## The other 90%

If the results of my work as they are today were adopted and displaced all the existing protocols, we'd get the following improvements:

### CJK IME

Control over the **position of the popup** window. Maybe you are typing right-to-left. Maybe up-to-down. I don't know, but the experimental popup uses the same positioner mechanism as the popups you see after right clicking. In this protocol, you can even have **multiple popups**!

Actually not an improvement because a dedicated **panel** window is missing from experimental. That's an intentional oversight coming from lack of manpower. Sorry. Layer-shell works OK for me.

Another missing feature is **pre-edit styling**. That one is missing because no user ever came to me to have a chat how they use it and why it is important. And less features is less complexity = more better. Lack of demand in action.

Then, there's no **language information** sent from the application to the IME. That's an oversight that should be fixed, although in such a way that doesn't favor the officially recognized languages, but instead lets the speakers define the language they use, no matter how obscure or disorganized, without having to beg an authority for recognition.

#### No more keyboard grab. No more generating keyboard events

This is another missing feature, but it's actually an improvement!

You see, behind keyboard event handling hides a realm where things – mostly key maps – wait for any misstep you made to turn your day into a hell of edge cases.

It's a source of real life bugs that had been reported to me unofficially, as "the Wayland input method authority".

Consider the happy path: the user presses Ctrl+B. The IME forwads the key press, to the application, and the application makes text bold.

Let's make things more complicated. The user presses Ctrl+Б (same key as Ctrl+B, just on a Russian Phonetic keyboard). The input method forwards the Б key sym, the application is confused and can't figure it out because it actually listens to key code 56.

Maybe we should send key codes then? Okay: the user presses Ctrl+Б (Russian Phonetic). The input method engine forwards the key code for Ctrl+Б, but - oops - in the regular Russian key map, which is key code 59. Not 56. The application is confused.

Those examples represent bugs. The one app could translate the keysym back to keycode (but what if there are 2 valid solutions?). The other input method engine could use the correct key map. But why open the opportunities for all those bugs where all you really want is forward the exact events the user already pressed?

So the experimental input method protocol provides a filtering mechanism instead. The IME can choose to intercept button presses that help typing your Kanji. Those will then not reach the application, and things work as expected.

As a bonus, this makes it easier to support improvements to the keyboard protocol like server-side **key repeat**, which solves more problems.

#### No more generating keyboard events, again

I hear you scream: "foul! Stop! Generating keys is still useful! Think of automating the desktop! And mobile keyboards need to send the »enter« key when submitting text!" To this, I reply: don't worry, there's a better way out.

I wasn't actually done complaining about keyboard events handling. In this Pandora's box, you can't know the consequences of your key presses in advance. There is no standard mandating that pressing Ctrl+X results in removing text. There isn't even a standard saying what to do when you press the humble "a"! Applications could do anything they want based on the current key map, locale, and the phase of the Moon, if their authors so please.

Trying to predict any of that in an input method engine is folly. You'd need to know everything about every application. Or cover 90% and hope for the best, but we're trying to be better than that.

So for the mobile use case, there are **actions**. Currently, there's only one: finish editing. Works like pressing "Enter" would: submit the field, or go to the next one.

There are more ideas outside of experimental, depending on the use case: adding more generic actions, defining a shortcuts protocol, and fixing virtual-keyboard. Keep reading, they are described later in this post.

### Across the board

**Consolidation**. While you still kinda have to choose between popups and panel, the experimental protocols are designed to be exactly compatible with text-input-v3 and its future developments, where the (precious little) social momentum seems to be.

**Atomic updates** group requests into units, applied all at once. Going from "blog pose|" to "blog post|" takes two logical steps: remove the last letter, add "t". If the IME submits them as two separate steps, the application doesn't know the intended output. As it receives the first step, it will draw and display "blog pos|". On receiving the second step, it discards that, and draws "blog post|". The first step ends up a waste of time and battery life. In extreme cases, it could lead to flickering as, for example, the candidates popup changes position to follow the cursor.

With atomic updates, the application can safely wait until the end of the command to start doing its work.

The protocols so far make a dangerous assumption: that requests get received immediately. What makes it dangerous is byte indexing.

Whenever there is a need to change cursor position, the new position is specified as a byte offset. This works well if the part of the system choosing the offset knows exactly and without delay what text is being indexed.

But Wayland is a distributed system. After the IME receives the contents of the text field, and before it chooses a byte offset, the text can change. With a bit of bad luck, the new context may fall inside a UTF-8 code point. The result could be trying to delete half a character like "я". Complete nonsense.

A diagram illustrating how deleting 1 byte can turn from correct to wrong.

Existing protocols don't say what to do, but the experimental ones **remove such ambiguities**.

#### Compatibility

This is a bigger problem, so it deserves its own section again.

With the protocols evolving, the need to stay **compatible** between the IME and the application arises.

In this model of input method, the compositor acts as a little more than a forwarder of events between the application and the input method. All is fine if they use matching protocol versions – ones that were literally made for each other, like text-input-v3.1 and input-method-v2.1. But what if they don't? Wayland allows clients to negotiate any protocol version they like, up to the highest one supported by the compositor, so there's no guarantee they match.

Imagine that (1.) the application lands on a higher protocol version and the IME on a lower one. In the best case, the user won't be able to take advantage of the app's features. Worst case, the protocol semantics differ slightly (which Wayland allows) – for example strings are in UTF-8 in version .4, but version .5 switches to UTF-16 – and both sides can't communicate, the user can't type and throws their useless laptop at the developer.

What about the opposite situation? 2. The application uses a lower protocol version, and the IME uses a higher one, that the application never heard of. Even if the above doesn't happen, the user may try to use some feature in the IME that the app just doesn't understand. Not a good look – it would be good to at least hide those. But that needs another communication channel.

So that's my solution: to have the compositor – which knows who uses which protocol version – communicate a compatibility level to the IME. In addition, when the protocol contains optional features (like different *actions*), it directly communicates what it can handle, so the IME can not present the unsupported options to the user.

But this is not a perfect solution. It only addresses the case where the application is less capable or older than the input method. After a little brainstorming with other Wayland developers about two-directional capability negotiation, it was suggested that it's too much trouble to implement: it would mean that every application library needs to implement an unusual, complicated version negotiation mechanism.

Instead, we cut the Gordian knot by requiring that the IME is always up to date.

This is a trade-off, and like all trade-offs, it comes with downsides. The users can't keep using unsupported IMEs any longer. Also, when shipping a distribution, all shipped IMEs must be updated or removed before any compositor starts supporting newer protocol versions.

**Addendum:** key forwarding.

Forwarding keys through the IME actually shares many of the above considerations, regardless if we're talking about event filtering or grabbing all and re-generating some key events. The IME needs to understand the events to make a forwarding decision, and the application needs to understand what the IME sends, so they must agree on a common set of functionality.

Thankfully, keyboards are a lot better understood than input methods. The protocols change a lot less, and it's clear what the base functionality is. Still, the protocols sometimes change, like the addition of server-side key repeat.

With grab and re-send, the common features were defined in the input method protocol directly. That's simple, but you need to redefine everything keyboard-related.

The filtering protocol lets the IME and the application use the keyboard protocol. The only alteration is that key events may be withheld. That should cause only few situations where sides can't just communicate. As a downside, the compositor must figure out how to make it work when they can't.

While making the protocol, I was aware of this tradeoff, so this protocol is separate from input method and could be rejected if the consensus is that the tradeoff sucks.

## Input method version 3.2

What do we gain if we don't go experimental yet, but do take in the v3.2 improvements?

With it, the application takes part in setting the **visibility of the on-screen keyboard panel**, if there is one. I think it's something Android applications like to do.

There's a small improvement in **synchronizing** the displayed windows that fixes edge cases.

It allows the IME to take over **displaying pre-edit** for applications that really don't want to do it.

Also, it gives the application more say in what **kind of text** the input method engine should provide it. While it's useful to send information like numbers-only, I expect this attempt to be primarily used in a way that makes things worse for the user. But hey, we don't have any input method user testing, so there's literally zero empirical information to help make the right call.

It also completely neglects the IME counterpart protocol, where things like popups live.

## Future

Of course, there's more. There *are* a few users who made their wishes known.

What to do when the **text field loses focus** while pre-edit is ongoing? Korean writers expect the pre-edit to be turned into permanent text. Chinese writers don't.

Some Chinese writers even suggested that the IME could come back to the same state when focusing the same field again.

That leads to wider persistence considerations. If I always chat with Дима in Russian, I'd like to always have my preferred Russian layout and the Russian language dictionary activated. The application could supply **stable text input field identifiers** for the IME, so the IME could save my settings permanently.

The keyboard Pandora's box could be closed back again with more work around **shortcuts** and actions. Think copy, paste, undo. There's a bunch of actions already defined in XKB in the form of keysyms (XKB_KEY_Undo, XKB_KEY_Redo, XKB_KEY_Menu, and so on), but it would be nice to have a protocol that's separate from keyboards entirely so that we don't try to extend it back into the problem we're just trying to avoid.

Further along that way, using global shortcuts or something similar, we could eventually build a system that allows **generating arbitrary actions**, while also preventing shortcut clashes (because all shortcuts would be registered centrally).

I saw **type-to-search** functionality being requested: where the application doesn't focus a text field, but the field will get focussed as soon as something is typed in. Thinking about it deeper, text-input-v3.2's ability for the application to hide the panel seems like it could solve it: for the purpose of the IME, the field is focussed, but no panel is shown.

## Guesses

Things that no one asked for, but I think input method is the right place for them, are:

**speech to text**. I think all is in place for that already.
**handwriting recognition**. On a train, I saw a lady write on a tablet with a stylus. The text turned into typed text automatically. I think we'd need to define another popup type, which covers the whole text field, and which the IME can make transparent to receive stylus events. This definitely needs more thought.
**gamepad input**, and other unusual input devices. I want my computer to be a proper game console! So does Valve, and they already do it. But Valve uses their own input method protocol. Hey Valve, why not join the standards track?

## Non-goals

The broad umbrella of user input is broad. Some things look like they share concern with input methods, but the commonality might be not so big.

One important out-of-scope case is **keyboard emulation**. If you look at an on-screen-keyboard from an input method and one used to emulate an array of switches, you may not immediately see the difference. Looking deeper, one may have buttons like "copy", the other will have "Alt" and "Control".

Apart from needing a panel, though, those are entirely different things. My input method effort completely ignores keyboard emulation. It would be just too much to handle.

But I can offer you a hint if that's what you need: fix the virtual-keyboard protocol. Rip out key map switching, and make it always follow the system key map. Switching key maps is a headache and the reason why the v1 of the protocol was not accepted in wayland-protocols. Go ahead and try to implement it and you'll see why.

**Typing in terminal windows** with an IME is also out of scope. We're back to the problem where key presses have no standardized consequences. When you type ":q!", will you get this text or will you lose all your data? Nobody knows, because there's no protocol to inform the input method whether it's in text edit mode.

If you wanted to correct this, you'd basically need to take all I did in Wayland and redesign it for VT100. Or whatever protocol terminals use these days.

Sorry, fellow hackers, I'm not motivated enough to do that. I code in Kate. Implement dumb keyboard emulation instead.

## Upstreaming

This effort will live or die depending on adoption, and adoption survives on upstreaming.

All along, I kept in touch with KDE through their input goal. They aren't yet implementing the experiments in their on-screen keyboard or in Kwin, so I hope this writeup helps clarify the benefits.

My own experimental IME is, sadly, useless. It's just too slow for practical use. If you're familiar with integrating egui and you want to fix reinitializing the GPU on every redraw, you can be my friend.

When upstreaming things to toolkits, I had generally good experience. Thanks, winit maintainers! Thanks, floem folks! But I only managed to add some things missing from text-input-v3.1 before running out of steam, so experimental improvements haven't been tried yet.

I made the least progress on the compositor side, where even MRs bringing it up to date with base Wayland protocols have been languishing for months.

Perhaps the lukewarm response to my work means that it comes before its time. Maybe what we need is a breakaway success of Linux Mobile, which would push input methods onto the main stage. Maybe we need a large business *cough* like Valve *cough* to do their own thing while contributing to the upstream.

Or maybe we need to wait until X11 stops being supported and the previously working ways to input CJK text stop working, causing riots that are too hard to ignore by the developer community.

## What next?

Myself, I'm taking a back seat now. I plan to push this forward only when I feel particularly inspired – after all, I have other things I'm burning to do now.

So, will input methods become a pleasure to use on Wayland? Now it's up to you – the community.

Comments