# State of input method

*This is an edited version of the "Input method on Wayland is broken and it's my fault" talk which I presented at FOSDEM 2024.*

My encounter with input methods started with the GNU/Linux-based Librem 5 phone, when I got hired to make sure people can type text on it.

Librem 5

As you can see, the phone has no keyboard. That would make entering text difficult on a typical Linux computer. Thankfully, we're not the first who needed to enter text without depending on a physical keyboard. The solutions to that are collectively called *input methods*. They can cover handwriting recognition, or entering scripts like Chinese where the characters on the keyboard are by themselves insufficient, or – and this is what we're looking for – selecting characters on on-screen "keyboards".

A screenshot of typing banglay

But how do we place an input method within the context of a GNU/Linux operating system?

Consider Wayland, which has been chosen as the centerpiece of the user interface on the Librem 5.

Wayland is a set of protocols connecting applications to the user, by the means of outputs, like the graphical display, and inputs: the mouse and keyboard. (It doesn't cover anything relating to sound, though, nor joysticks, for some reason). My task was to find an alternative to typing, so it would cover some of the tasks that a keyboard typically does. That suggests that Wayland is indeed the right place to put an input method – in place of the keyboard.

It turns out that I wasn't the only one having that thought. Wayland has seen attempts to create protocols for this: `zwp_input_method_unstable_v1` and `zwp_text_input_unstable_v2`.

I took those protocols and tried to implement them, but I used them wrong, resulting in breakage which I'm about to warn you against.

Haha, just kidding. My transgressions against Wayland were much worse.

The two existing protocols were very basic, and didn't really cover the actual needs of nontrivial input methods. I decided to update them and create improved versions. This is where it gets much worse: I embedded mistakes within the protocols, so no matter how hard you try, you can't use them correctly.

## Mistake 1: being too conservative

A user of a modern, keyboard-less mobile phone expects the on-screen keyboard to do more than just input text. Actions like moving the cursor within a text field, moving to the next one, or submitting the form are not related to typing – they don't produce text. Yet, because of the expectation, the tech stack on the Librem 5 had to support this.

I decided to not introduce too many changes at the same time, and re-use an experimental keyboard emulation protocol that we supported for other reasons. That protocol would be used to submit just the non-text actions.

It looked great, until it became clear that it's a dead end.

To understand why, you have to take a look at how keyboard protocols work. What the user wants is to submit text. With a text-input protocol, it's easy. Entering the text "Błąd" would look something like this:

  1. Use text "B"
  2. Use text "Bł"
  3. Use text "Błą"
  4. Use text "Błąd"
  5. Finish

But a keyboard is not at its core a device to enter text, but to press keys. It's concerned with buttons and whether they were pressed and released. Our simple example looks more like:

  1. KeyDown `SHIFT`
  2. KeyDown `B`
  3. KeyUp `B`
  4. KeyUp `SHIFT`
  5. KeyDown `AltGr`
  6. KeyDown `L`

and so on.

In reality, the buttons don't even have names, but numbers that need to get resolved to names. It looks closer to this:

  1. KeyMap *0xffffbced
  2. ModifierDown 0x1
  3. KeyDown 0x102
  4. KeyUp 0x102
  5. ModifierUp 0x1
  6. ModifierDown 0x4
  7. KeyDown 0x56

and so on.

The tables responsible for resolving button numbers to actual text are called keymaps, and they are the problem here.

There are keyboards that support multiple scripts. A button can be labelled with "Q" and "Й" at the same time, even though its number can't be changed. The user is responsible for telling the software to use the currently intended keymap.

Then, there are custom keyboards. Those can send arbitrary key numbers as well, but there's no guarantee that a button labeled "P" on one sends the same number as the button labeled "P" on another. Again, the user must indicate the intended keymap.

An emulated keyboard is a kind of a custom keyboard, needing a custom keymap, independent of any physical keyboards that might be connected.

But Wayland combines all keyboard events into a single stream before giving them to an application.

Diagram showing two different keyboards with different letters for the same code, and how they are interpreted as a single keyboard by the application

That means the application must always have the correct key map for every key press. This works until you consider that a user might want to smash and hold keys on multiple distinct keyboards at the same time. Long story short, Wayland maintainers told me that it's basically impossible to make this work due to corner cases, and they won't accept the keyboard emulation protocol.

This is bad. A protocol without Wayland buy-in will not get widespread adoption. And we must support non-text events if we want to meet users' expectations about how a phone on-screen keyboard works.

This makes emulating keyboards a dead end, and this mistake results in having no viable solution for the phone use case.

### Actions

Not all is bad. There has been some talk about an "actions" protocol to cover those needs. It's still unclear, but it could allow the user to do things like copy and paste, select all, submit the form, etc.

## Mistake 2: bad synchronization

What should happen if you're entering two words into a text field but the application lags for a moment?

If you're answer is "the application should show both words", then I agree with you.

Except this is forbidden in my protocol. The text-input protocol I came up with will drop events a lot for that reason, and you'll end up with missing letters. That's bad.

I did it for a reason. An input method must be aware of the text inside the input field, if only to present the user with autocomplete suggestions. I came up with a protocol which carries the state identifier with every request.

Diagram showing input method sending "M" based on state 0, and input method in state 0 accepting it

Every event of an input method pertains to some state, so that the application can reject events if they apply to text which is no longer there.

Diageam showing input method showing "M" based on state 0, and then "Mo" based on state 0 again, and input method being in state 0 and accepting "M" and afterwards being in state 1nd rejecting "Mo" due to state mismatch

So when the input method sends two events at once, and both apply to the original state, then after the first event is applied, the original state is already gone and the second event is automatically invalid.

"That's wrong", I hear you say. "Why not just let the input method override state changes? It would only be a problem if the user is typing into the input field from another source, and this is such a contrived case that it's not worth caring about."

To this, I say: I don't want to mandate which use cases are contrived or not. Initially, that was my only reasoning behind this aspect, but before FOSDEM, I came up with an actually realistic use case: two people editing the same text field. We don't even have to get into Wayland seats. It's already a thing in collaboratively edited Web documents: the text may change here while you're typing there. Should your input method automatically override the other person's edits?

This is a hard problem, and I have no solution to it.

## Progress

Sadly, since I was moved to other work, there has been very little progress on those fronts. The last change to the new text-input protocol is 2 years old, and it adds my 4 years old commit.

If you want to take part in the effort to change this, feel free to contact me.

Written on .


dcz's projects

Thoughts on software and society.

Atom feed