martin carpenter


most popular
2012/05/05, updated 2012/12/15
ubuntu unity lens for vim

determining password dump origin using n-grams


tags: password


Password dumps sometimes appear in intelligence feeds but often there is no clue as to from what service or application the dump came. By exploiting a weakness in the way some users choose passwords we can sometimes work out the name of the originating application.


It turns out that users like to use the name of the application (or part of the name, or an acronym of the name) in their passwords. This, obviously, is not a recommended best practice but if you have enough users you can guarantee that at least some number of them will do this. However you still don't know the name of the application in question, how long it is, whether it's at the start of the password, or the end or somewhere in the middle...

This is where n-grams come in. An n-gram is simply a sequence of consecutive characters of length n. For example, the 4-grams (n-grams of length 4) of "password" are: "pass", "assw", "sswo", "swor" and "word". For some value of n, we can find all of the n-grams each password, count how many of each there are, order them by frequency, and see what pops out. Since computers are zoomy and application names and user-chosen passwords are typically short we can do this for all values of n between, say, 3 and 10.

Dump data is nearly always dirty, and we don't know what its character encoding might be. All we're really looking for is common sequences of bytes, (plain old char*) so I just ignored encoding. I also didn't do any other pre-processing of the data. (Case folding, for example, might be a good addition but I didn't need it).

This seemed like a nice little project so I thought I'd try a new language. Nim is friendly, with fast compilation and runtime and sensible documentation. I produced a legible 25-line solution in under an hour from first contact:

import os, posix, strutils, tables

proc main() =
  if paramCount() < 1:
    quit("synopsis: " & getAppFilename() & " n")
  let n = parseInt(paramStr(1))

  var counter = newCountTable[string]()
  var password = ""

  while stdin.readLine(password):
    for i in 0..password.len-n:
      let ngram = password[i..i+n-1]

  # If the user interrupts output after the top m entries this is
  # O(m.no_ngrams). So for large m this may be worse than the "naive"
  # solution of calling counter.sort().  But typically m << no_ngrams.
  signal(SIG_PIPE, SIG_DFL) # no error on truncated output
  while len(counter) > 0:
    var (ngram, freq) = counter.largest()
    echo ngram & "   " & $freq


Compile with:

$ nim c -o:ngram ngram.nim

Does it work?


In February 2018, health site MyFitnessPal was compromised and attackers stole the database containing 150 million user credentials. MyFitnessPal noticed the breach on the 25th of March, disclosed it on the 29th and mandated password changes for all users. Some passwords were hashed using bcrypt, which is a solid choice, but many were hashed using the older SHA-1 algorithm, which is far less solid.

We can guess that this discrepancy was due to a slow-running algorithm migration, upgrading users to the new hash algorithm as they logged in. MFP could instead have pre-emptively hashed the SHA-1 hashes with bcrypt until next login which would have made the attacker's life somewhat harder. Instead the old SHA-1 hashes simplified the job of the attacker and lists of de-hashed passwords started to circulate on the dark web. UK Tech News reported that those password dumps were being advertised for sale for $20,000 in 2019.

4 years on and this is old news. Those same dumps can be found on the open internet. (Even national crime agencies are sharing password dumps with third parties these days). There is some selection bias here since dumps only contains passwords from hashes that were successfully cracked but it's good enough to demonstrate the concept. I looked at an alleged partial dump containing 50 million passwords, removed the user information and tidied up line endings.

Below is the output of ngram n <passwords.txt | head, for n∊[3..14]. Each run took between one minute (n=14) and five minutes (n=6). Imagine that you don't know the dump is from MFP.


Notice anything?

In The Science of Guessing: Analyzing an Anonymized Corpus of 70 Million Passwords, Bonneau speculates on repeated passwords in obfuscated password data from Yahoo!:

However, there were eight passwords which occurred with probability at least 0.01% in every subpopulation. Without access to the raw passwords, we can only speculate that these are numeric passwords as these are popular and internationalize well.

Is it possible one of these repeated passwords was simply "yahoo" or "yahoo!"? In the partial dump that I analyzed, 0.11% of passwords contained the substring "myfitnesspal" and 0.05% (1 in 2000) were exactly "myfitnesspal".

And finally