some text

Association rule mining

1 Association rule mining

In this notebook, you’ll implement the basic pairwise association rule mining algorithm.

To keep the implementation simple, you will apply your implementation to a simplified

dataset, namely, letters (“items”) in words (“receipts” or “baskets”). Having finished that code,

you will then apply that code to some grocery store market basket data. If you write the code

well, it will not be difficult to reuse building blocks from the letter case in the basket data case.

1.1 Problem definition

Let’s say you have a fragment of text in some language. You wish to know whether there are

association rules among the letters that appear in a word. In this problem:

• Words are “receipts”

• Letters within a word are “items”

b, where a and b are

letters. You will write code to do that by calculating for each rule its confidence, conf(a =


You want to know whether there are association rules of the form, a =



“Confidence” will be another name for an estimate of the conditional probability of b given a, or




1.2 Sample text input

Let’s carry out this analysis on a “dummy” text fragment, which graphic designers refer to as the

lorem ipsum:

In [ ]: latin_text = “””

Sed ut perspiciatis, unde omnis iste natus error sit

voluptatem accusantium doloremque laudantium, totam

rem aperiam eaque ipsa, quae ab illo inventore

veritatis et quasi architecto beatae vitae dicta

sunt, explicabo. Nemo enim ipsam voluptatem, quia

voluptas sit, aspernatur aut odit aut fugit, sed

quia consequuntur magni dolores eos, qui ratione

voluptatem sequi nesciunt, neque porro quisquam est,

qui dolorem ipsum, quia dolor sit amet consectetur

adipisci[ng] velit, sed quia non numquam [do] eius


modi tempora inci[di]dunt, ut labore et dolore

magnam aliquam quaerat voluptatem. Ut enim ad minima

veniam, quis nostrum exercitationem ullam corporis

suscipit laboriosam, nisi ut aliquid ex ea commodi

consequatur? Quis autem vel eum iure reprehenderit,

qui in ea voluptate velit esse, quam nihil molestiae

consequatur, vel illum, qui dolorem eum fugiat, quo

voluptas nulla pariatur?

At vero eos et accusamus et iusto odio dignissimos

ducimus, qui blanditiis praesentium voluptatum

deleniti atque corrupti, quos dolores et quas

molestias excepturi sint, obcaecati cupiditate non

provident, similique sunt in culpa, qui officia

deserunt mollitia animi, id est laborum et dolorum

fuga. Et harum quidem rerum facilis est et expedita

distinctio. Nam libero tempore, cum soluta nobis est

eligendi optio, cumque nihil impedit, quo minus id,

quod maxime placeat, facere possimus, omnis voluptas

assumenda est, omnis dolor repellendus. Temporibus

autem quibusdam et aut officiis debitis aut rerum

necessitatibus saepe eveniet, ut et voluptates

repudiandae sint et molestiae non recusandae. Itaque

earum rerum hic tenetur a sapiente delectus, ut aut

reiciendis voluptatibus maiores alias consequatur

aut perferendis doloribus asperiores repellat.


print(“First 100 characters:\n {} …”.format(latin_text[:100]))

Exercise 0 (ungraded). Look up and read the translation of lorem ipsum!

Data cleaning. Like most data in the real world, this dataset is noisy. It has both uppercase and

lowercase letters, words have repeated letters, and there are all sorts of non-alphabetic characters.

For our analysis, we should keep all the letters and spaces (so we can identify distinct words), but

we should ignore case and ignore repetition within a word.

For example, the eighth word of this text is “error.” As an itemset, it consists of the three unique


e, o, r



. That is, treat the word as a set, meaning you only keep the unique letters.

This itemset has three possible itempairs:

e, o


e, r

, and


o, r







Start by writing some code to help “clean up” the input.

Exercise 1 (normalize_string_test: 2 points). Complete the following function,

normalize_string(s). The input s is a string (str object). The function should return a new

string with (a) all characters converted to lowercase and (b) all non-alphabetic, non-whitespace

characters removed.

Clarification. Scanning the sample text, latin_text, you may see things that look like

special cases. For instance, inci[di]dunt and [do]. For these, simply remove the

non-alphabetic characters and only separate the words if there is explicit whitespace.


For instance, inci[di]dunt would become incididunt (as a single word) and [do]

would become do as a standalone word because the original string has whitespace on

either side. A period or comma without whitespace would, similarly, just be treated

as a non-alphabetic character inside a word unless there is explicit whitespace. So

e pluribus.unum basium would become e pluribusunum basium even though your

common-sense understanding might separate pluribus and unum.

Hint. Regard as a whitespace character anything “whitespace-like.” That is, consider

not just regular spaces, but also tabs, newlines, and perhaps others. To detect whitespaces

easily, look for a “high-level” function that can help you do so rather than checking

for literal space characters.

In [ ]: def normalize_string(s):

assert type (s) is str




# Demo:

print(latin_text[:100], “…\n=>”, normalize_string(latin_text[:100]), “…”)

In [ ]: # `normalize_string_test`: Test cell

norm_latin_text = normalize_string(latin_text)

assert type(norm_latin_text) is str

assert len(norm_latin_text) == 1694

assert all([c.isalpha() or c.isspace() for c in norm_latin_text])

assert norm_latin_text == norm_latin_text.lower()


Exercise 2 (get_normalized_words_test: 1 point). Implement the following function,

get_normalized_words(s). It takes as input a string s (i.e., a str object). It should return a

list of the words in s, after normalization per the definition of normalize_string(). (That is, the

input s may not be normalized yet.)

In [ ]: def get_normalized_words (s):

assert type (s) is str




# Demo:

print (“First five words:\n{}”.format (get_normalized_words (latin_text)[:5]))

In [ ]: # `get_normalized_words_test`: Test cell

norm_latin_words = get_normalized_words(norm_latin_text)

assert len(norm_latin_words) == 250


Order a unique copy of this paper
(550 words)

Approximate price: $22

Place Order
Basic features
  • Free title page and bibliography
  • Unlimited revisions
  • Plagiarism-free guarantee
  • Money-back guarantee
  • 24/7 support
On-demand options
  • Writer’s samples
  • Part-by-part delivery
  • Overnight delivery
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 275 words per page
  • 12 pt Arial/Times New Roman
  • Double line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Our guarantees

We value our customers and so we ensure that what we do is 100% original..
With us you are guaranteed of quality work done by our qualified experts.Your information and everything that you do with us is kept completely confidential.

Zero-plagiarism guarantee

The Product ordered is guaranteed to be original. Orders are checked by the most advanced anti-plagiarism software in the market to assure that the Product is 100% original. The Company has a zero tolerance policy for plagiarism.

Read more

Free-revision policy

The Free Revision policy is a courtesy service that the Company provides to help ensure Customer’s total satisfaction with the completed Order. To receive free revision the Company requires that the Customer provide the request within fourteen (14) days from the first completion date and within a period of thirty (30) days for dissertations.

Read more

Privacy policy

The Company is committed to protect the privacy of the Customer and it will never resell or share any of Customer’s personal information, including credit card data, with any third party. All the online transactions are processed through the secure and reliable online payment systems.

Read more

Fair-cooperation guarantee

By placing an order with us, you agree to the service we provide. We will endear to do all that it takes to deliver a comprehensive paper as per your requirements. We also count on your cooperation to ensure that we deliver on this mandate.

Read more

Calculate the price

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
The price is based on these factors:
Academic level
Number of pages