Introducing the first ever shitpost-based RNG

Using Twitter as an entropy source: the Internet is a chaotic mess, which happens to be a good source of entropy

x0rz
Just another infosec blog type of thing

--

Hey, what’s entropy?

Looking at Wikipedia:

entropy is the randomness collected by an operating system or application for use in cryptography or other uses that require random data.

Basically, when you’re generating keys (let’s say RSA 2048) you need true random numbers. You don’t want them to be deterministic (based on a timestamp or some vague ID): cryptography relies on good randomness.
For example, when you boot up a computer, how can this randomness be different from any other computer? This is especially a big issue for IoT devices (routers) with very few sources of entropy — and yet they (sometimes) have to generate keys at boot-time! Resulting in hosts sharing the same keys due to bad randomness:

Source: http://web.stanford.edu/class/ee380/Abstracts/150513-slides.pdf

On Linux, you have what is called an “entropy pool”, a pool of random bits assumed to be truly random. The pool is fed from various sources considered somewhat random (mouse mouvements, keystrokes, network events, etc.). To visualize how many bits of entropy available you have on Linux you can issue this command:

cat /proc/sys/kernel/random/entropy_avail

Note that if you read from /dev/random (making your system generating random data) the entropy available will drop quite rapidly, the read will block until more entropy is available. If you need to generate lots of random data you can use /dev/urandom that is non-blocking.

How can we add entropy on Linux?

One basic trick is to move your mouse and type random things on your keyboard. The Linux kernel will automatically add these inputs as entropy bits. But one cannot do that all day long, right?

Looking at the random man page :

Writing to /dev/random or /dev/urandom will update the entropy pool with the data written […]

Then, we have to issue an ioctl() system call request on the RNDADDENTROPY file descriptor to tell the kernel to update its entropy count accordingly.

More about entropy on the CloudFlare’s blog:

CloudFlare is known to use lava lamps as a source of entropy, so why not using some of that Internet noise too?

Twitter gibberish

We all know Twitter is a great noise source: Russian bots, Trump rants, human interactions, malware traffic and so on. This is a great entropy source as it is actually quite random and have unpredictable content. Unpredictability is an important security property: an adversary cannot guess the output of the random number generator.

But Twitter is public data, is it cryptographically secure?

You’re right! It isn’t secret data, but it is unpredictable. Using Twitter as your only source of entropy would be terribly bad (see footnote). But your system knows better, it already has a few other entropy sources! Entropy sources are mixed together and have this incredible property: you cannot have less entropy. The more, the better!

Getting random Tweets

Twitter allows anyone to retrieve a small random sample of all public statuses being published in real-time using the GET statuses/sample API.

Twitter noise (concatenated UTF-8 encoded tweets)

Sampling 500KB of raw tweet data, entropy is around 6.5519 bits per byte (true random would reach 8) and the arithmetic mean value is 135.65 (random would be around 127.5). This is not perfect random but we’re getting there!

Now, we could either “compress” the bytes until having a high entropy (hashing) or mix it with another PRNG (inheriting the selected PRNG weaknesses, so not the most secure way of dealing with this issue — see footnotes).

Mixing it with /dev/urandom, here is what our shitpost-based RNG look like:

Twitter noise mixed with PRNG: Twitter-based random

I made a simple implementation of a Twitter based entropy collector, called tweetentro.py.

Caveats

This entropy source isn’t truly random, there are some statistical biases: there will be lots of hashtags (#), links and emojis. We can safely assume there will be repeating occurrences in the data (trending hashtags, popular links, …). Not to mention this stream can be manipulated (bots spamming) — even though I highly doubt any attack of that sort is practical.

In addition, according to the developer documentation, if two different clients connect to the API they will get served the same Tweets from Twitter. So data won’t be truly “random” for these two… As a result, this entropy source should be considered as a shared resource. I wouldn’t depend on it as the only additional source of entropy to generate sensitive keys. You should mix it up with additional random sources of entropy (HRNG for example).

⚠️ The purpose of this article is to illustrate a joke & recreational RNG. Not something to be taken seriously.

tl;dr: Do not use this for sensitive cryptographic operations.

Feel free to discuss about cryptographically secure random or entropy with me on Twitter 😉

If you liked the article, you can also buy me a coffee ☕ anytime!

--

--