hardBacktrackingAI-applied~36 min
Tokenizer — All Valid Segmentations
A subword tokenizer must enumerate every way an un-spaced input string can be split into tokens from its vocabulary — useful when scoring alternative segmentations before picking the most probable one.
Problem
Given a string s and a dictionary of words wordDict, return all sentences where s is segmented into a space-separated sequence of dictionary words. Each dictionary word may be reused. Return every valid segmentation.
Input
A string s (1 ≤ length ≤ 20) and wordDict of up to 1000 distinct words, each up to length 10.
Output
A list of strings, each a space-joined valid segmentation of s (any order).
Constraints
- Words in the dictionary are reusable
- Memoise by suffix to avoid recomputing shared tails
- There may be exponentially many segmentations
Examples
Example 1
Input
s = "catsanddog", wordDict = ["cat","cats","and","sand","dog"]
Output
["cats and dog","cat sand dog"]
Two valid splits exist.
Example 2
Input
s = "pineapplepenapple", wordDict = ["apple","pen","applepen","pine","pineapple"]
Output
["pine apple pen apple","pineapple pen apple","pine applepen apple"]
Three distinct segmentations.