IndexedDict: A Data Structure for Vocabulary in NLP Models


In NLP models, we generally embed words from one-hot vector space, where each word represented by single-entry vector ([0,...,1,...,0]), to low dimensional space, which is continuous embedding space. This is basically done by multiplying one-hot vectors with a projection matrix. Albeit, we don’t materialize this vectors and the multiplication in computer, it is done by indexing the projection matrix. Therefore, we don’t need to store one-hot vector [0,...,1,...,0] at no time, instead we store index of the single-entry-1(e.g 874). You may check onehot.jl from Flux.jl to see how this is implemented.

However, the problem is in generating those vectors, and Flux.jl uses search(O(n)) to get indices from given labels(words). This is little problematic if you want to train your model in a large dataset. I think we need hashing instead of searching.

function onehot(l, labels)
  i = something(findfirst(isequal(l), labels), 0) # <= !!!search!!!
  i > 0 || error("Value $l is not in labels")
  OneHotVector(i, length(labels))
end

So, we were using a Dictionary structure which maps words to their index in the vocabulary. We, then, store only word indices(=OneHot structure) in the final data. Dictionary structure is also helpful in easily creating your vocabulary by single pass over data.

julia> data = "this is an example data with example words"
"this is an example data with example words"

julia> vocab = Dict{String,Int}()
Dict{String,Int64} with 0 entries

julia> for word in split(data)
           get!(vocab,word,length(vocab)+1)
       end

julia> vocab
Dict{String,Int64} with 7 entries:
  "this"    => 1
  "is"      => 2
  "example" => 4
  "data"    => 5
  "words"   => 7
  "with"    => 6
  "an"      => 3

julia> vocab["example"]
4

julia> encoded = map(x->vocab[x],split(data))
8-element Array{Int64,1}:
 1
 2
 3
 4
 5
 6
 4
 7

However, you also may want to decode those indices while you are printing your data. The dictionary structure that we use maps word=>index not index=>word. Simply, an array of the words, in the order of their vocabulary index, will serve for latter mapping that we need. What we were doing was storing an extra word array separately, and use them when needed.

julia> function invert(vocab)
              int2tok = Array{String}(undef,length(vocab))
              for (k,v) in vocab; int2tok[v] = k; end
              return int2tok
       end

julia> words = invert(vocab)
7-element Array{String,1}:
 "this"
 "is"
 "an"
 "example"
 "data"
 "with"
 "words"

julia> words[4]
"example"

julia> vocab["example"]
4

This enable us to go in both direction in O(1) time by using 2x memory. However, this two separate data structure (words and vocab) was making my codes complicated. So, this is where IndexedDict comes to play. IndexedDict is an indexed dictionary which works in both ways. You can think it as a multiple inheritance of Dictionary and Array. You may use push! and append! methods to add new words to your vocabulary. It automatically checks new words and adds them to dictionary. You can index it with integers or you can also get the index of an element.

julia> data = "this is an example data with example words"
"this is an example data with example words"

julia> vocab = IndexedDict{String}()
IndexedDict{String}(Dict{String,Int64}(), String[])

julia> append!(vocab,split(data))
IndexedDict{String}(Dict("this"=>1,"is"=>2,"example"=>4,"data"=>5,"words"=>7,"with"=>6,"an"=>3), ["this", "is", "an", "example", "data", "with", "words"])

julia> vocab["example"]
4

julia> vocab[4]
"example"

julia> vocab[[1,2,3,4]]
4-element Array{String,1}:
 "this"
 "is"
 "an"
 "example"

julia> vocab[["example","this"]]
2-element Array{Int64,1}:
 4
 1

julia> push!(vocab,"a-new-word")
IndexedDict{String}(Dict("a-new-word"=>8,"this"=>1,"is"=>2,"example"=>4,"data"=>5,"words"=>7,"with"=>6,"an"=>3), ["this", "is", "an", "example", "data", "with", "words", "a-new-word"])

julia> vocab["a-new-word"]
8

julia> length(vocab)
8

You may also directly create an IndexedDict from an existing array or a dictionary, but you need to be careful you supply unique elements or a proper dictionary.

julia> data = ["this", "is", "an", "example", "data", "with", "unique","words"]
8-element Array{String,1}:
 "this"
 "is"
 "an"
 "example"
 "data"
 "with"
 "unique"
 "words"

julia> vocab = IndexedDict(data)
IndexedDict{String}(Dict("unique"=>7,"this"=>1,"is"=>2,"example"=>4,"data"=>5,"words"=>8,"with"=>6,"an"=>3), ["this", "is", "an", "example", "data", "with", "unique", "words"])

julia> vocab["unique"]
7

julia> vocab[7]
"unique"

julia> length(vocab)
8
julia> dict = Dict("word"=>1,"this"=>2)
Dict{String,Int64} with 2 entries:
  "this" => 2
  "word" => 1

julia> vocab = IndexedDict(dict)
IndexedDict{String}(Dict("this"=>2,"word"=>1), ["word", "this"])

julia> vocab["this"]
2

julia> vocab[2]
"this"

julia> length(vocab)
2

With IndexedDict my parsing and model codes became more readable, so I would like to share it. If you have better data structure for the purpose let me know!

The source code:

Leave a Reply

Your email address will not be published. Required fields are marked *

*