martin carpenter / blog / Determining password dump origin using n-grams

contact

blog

2022/01/31
determining password dump origin using n-grams

2015/07/01
escape from ansible

2015/06/24
solaris elfsign vulnerability

2014/10/22
metasploiting nrpe

2014/01/24
readable setuid cores in desktop ubuntu

2014/01/20
enlightenment sysactions vulnerability

determining password dump origin using n-grams

2022/01/31

tags: password

Introduction

Password dumps sometimes appear in intelligence feeds but often there is no clue as to from what service or application the dump came. By exploiting a weakness in the way some users choose passwords we can sometimes work out the name of the originating application.

Method

It turns out that users like to use the name of the application (or part of the name, or an acronym of the name) in their passwords. This, obviously, is not a recommended best practice but if you have enough users you can guarantee that at least some number of them will do this. However you still don't know the name of the application in question, how long it is, whether it's at the start of the password, or the end or somewhere in the middle...

This is where n-grams come in. An n-gram is simply a sequence of consecutive characters of length n. For example, the 4-grams (n-grams of length 4) of "password" are: "pass", "assw", "sswo", "swor" and "word". For some value of n, we can find all of the n-grams each password, count how many of each there are, order them by frequency, and see what pops out. Since computers are zoomy and application names and user-chosen passwords are typically short we can do this for all values of n between, say, 3 and 10.

Dump data is nearly always dirty, and we don't know what its character encoding might be. All we're really looking for is common sequences of bytes, (plain old char*) so I just ignored encoding. I also didn't do any other pre-processing of the data. (Case folding, for example, might be a good addition but I didn't need it).

This seemed like a nice little project so I thought I'd try a new language. Nim is friendly, with fast compilation and runtime and sensible documentation. I produced a legible 25-line solution in under an hour from first contact:

import os, posix, strutils, tables

proc main() =
  if paramCount() < 1:
    quit("synopsis: " & getAppFilename() & " n")
  let n = parseInt(paramStr(1))

  var counter = newCountTable[string]()
  var password = ""

  while stdin.readLine(password):
    for i in 0..password.len-n:
      let ngram = password[i..i+n-1]
      counter.inc(ngram)

  # If the user interrupts output after the top m entries this is
  # O(m.no_ngrams). So for large m this may be worse than the "naive"
  # solution of calling counter.sort().  But typically m << no_ngrams.
  signal(SIG_PIPE, SIG_DFL) # no error on truncated output
  while len(counter) > 0:
    var (ngram, freq) = counter.largest()
    counter.del(ngram)
    echo ngram & "   " & $freq

main()

Compile with:

$ nim c -o:ngram ngram.nim

Does it work?

Example

In February 2018, health site MyFitnessPal was compromised and attackers stole the database containing 150 million user credentials. MyFitnessPal noticed the breach on the 25th of March, disclosed it on the 29th and mandated password changes for all users. Some passwords were hashed using bcrypt, which is a solid choice, but many were hashed using the older SHA-1 algorithm, which is far less solid.

We can guess that this discrepancy was due to a slow-running algorithm migration, upgrading users to the new hash algorithm as they logged in. MFP could instead have pre-emptively hashed the SHA-1 hashes with bcrypt until next login which would have made the attacker's life somewhat harder. Instead the old SHA-1 hashes simplified the job of the attacker and lists of de-hashed passwords started to circulate on the dark web. UK Tech News reported that those password dumps were being advertised for sale for $20,000 in 2019.

4 years on and this is old news. Those same dumps can be found on the open internet. (Even national crime agencies are sharing password dumps with third parties these days). There is some selection bias here since dumps only contains passwords from hashes that were successfully cracked but it's good enough to demonstrate the concept. I looked at an alleged partial dump containing 50 million passwords, removed the user information and tidied up line endings.

Below is the output of ngram n <passwords.txt | head, for n∊[3..14]. Each run took between one minute (n=14) and five minutes (n=6). Imagine that you don't know the dump is from MFP.

Rank	n=3	n=4	n=5	n=6	n=7	n=8	n=9	n=10	n=11	n=12	n=13	n=14
1	000	0000	10000	100000	fitness	password	fitnesspa	fitnesspal	myfitnesspa	myfitnesspal	myfitnesspal1	myfitnesspal12
2	123	1000	00000	fitnes	passwor	fitnessp	myfitness	myfitnessp	yfitnesspal	yfitnesspal1	myfitnesspal2	caloriecounter
3	100	love	ilove	itness	assword	yfitness	itnesspal	yfitnesspa	fitnesspal1	justinbieber	yfitnesspal12	yfitnesspal123
4	ove	1234	00001	100001	itnessp	itnesspa	yfitnessp	loseweight	christopher	onedirection	fitnesspal123	myfitnesspal20
5	and	ness	tness	ssword	yfitnes	myfitnes	password1	basketball	password123	yfitnesspal2	livelaughlove	physiciansplan
6	ell	ilov	12345	asswor	tnesspa	tnesspal	oseweight	weightloss	justinbiebe	fitnesspal12	aloriecounter	yfitnesspal201
7	lov	0001	itnes	passwo	myfitne	princess	chocolate	volleyball	ustinbieber	caloriecount	caloriecounte	$HEX[6a6f686e3
8	ter	itne	fitne	123456	nesspal	sunshine	loseweigh	tinkerbell	nedirection	peanutbutter	myfitnesspal4	weightwatchers
9	201	2345	passw	100002	charlie	iloveyou	butterfly	itnesspal1	looseweight	itnesspal123	myfitnesspal0	myfitnesspal19
10	ess	tnes	sword	weight	loveyou	assword1	elizabeth	hellokitty	onedirectio	livelaughlov	fitnesspal201	fitnesspal2013

Notice anything?

In The Science of Guessing: Analyzing an Anonymized Corpus of 70 Million Passwords, Bonneau speculates on repeated passwords in obfuscated password data from Yahoo!:

However, there were eight passwords which occurred with probability at least 0.01% in every subpopulation. Without access to the raw passwords, we can only speculate that these are numeric passwords as these are popular and internationalize well.

Is it possible one of these repeated passwords was simply "yahoo" or "yahoo!"? In the partial dump that I analyzed, 0.11% of passwords contained the substring "myfitnesspal" and 0.05% (1 in 2000) were exactly "myfitnesspal".

And finally

Theologians: you might like the passwords referencing Jeremiah 29:11 and Philippians 4:13.
Tortured souls with passwords "chocolate" and "peanutbutter": I feel you.
The joker using only Unicode BLACK CIRCLE (U+25CF) characters (●●●●...): are you really copy/pasting the password prompt obfuscation character back into the prompt so that you can log in with any string of the right length?! Cute, but so broken.

martin carpenter

contents

determining password dump origin using n-grams

Introduction

Method

Example

And finally