Substitution mining¶
Mine substitutions with various mining models.
This module defines several classes and mixins to mine substitutions in the MemeTracker dataset with a series of different models.
Time
, Source
, Past
and Durl
together define
how a substitution Model
behaves. Interval
is a utility class
used internally in Model
. The ClusterMinerMixin
mixin builds
on this definition of a substitution model to provide
ClusterMinerMixin.substitutions()
which iterates over all valid
substitutions in a Cluster
. Finally,
mine_substitutions_with_model()
brings ClusterMinerMixin
and
SubstitutionValidatorMixin
(which checks for spam substitutions)
together to mine for all substitutions in the dataset for a given
Model
.
-
class
brainscopypaste.mine.
ClusterMinerMixin
[source]¶ Bases:
object
Mixin for
Cluster
s that provides substitution mining functionality.This mixin defines the
substitutions()
method (based on the private_substitutions()
method) that iterates through all valid substitutions for a givenModel
.-
classmethod
_substitutions
(source, durl, model)[source]¶ Iterate through all substitutions from source to durl considered valid by model.
This method yields all the substitutions between source and durl when model allows for multiple substitutions.
Parameters: source :
Quote
Source for the substitutions.
durl :
Url
Destination url for the substitutions.
model :
Model
Model that validates the substitutions between source and durl.
-
substitutions
(model)[source]¶ Iterate through all substitutions in this cluster considered valid by model.
Multiple occurrences of a sentence at the same url (url “frequency”) are ignored, so as not to artificially inflate results.
Parameters: model :
Model
Model for which to mine substitutions in this cluster.
Yields: substitution :
Substitution
All the substitutions in this cluster considered valid by model. When model allows for multiple substitutions between a quote and a destination url, each substitution is yielded individually. Any substitution yielded is attached to this cluster, so if you use this in a
session_scope()
substitutions will be saved automatically unless you explicitly rollback the session.
-
classmethod
-
class
brainscopypaste.mine.
Durl
[source]¶ Bases:
enum.Enum
Type of quotes accepted as substitution destinations.
-
all
= <Durl.all: 1>¶ All quotes are potential destinations for substitutions.
-
-
class
brainscopypaste.mine.
Interval
(start, end)[source]¶ Bases:
object
Time interval defined by start and end
datetime
s.Parameters: start : :class:datetime.datetime
The interval’s start (or left) bound.
end : :class:datetime.datetime
The interval’s end (or right) bound.
Raises: Exception
If start is strictly after end in time.
Examples
Test if a
datetime
is in an interval:>>> from datetime import datetime >>> itv = Interval(datetime(2016, 7, 5, 12, 15, 5), ... datetime(2016, 7, 9, 13, 30, 0)) >>> datetime(2016, 7, 8) in itv True >>> datetime(2016, 8, 1) in itv False
-
class
brainscopypaste.mine.
Model
(time, source, past, durl, max_distance)[source]¶ Bases:
object
Substitution mining model.
A mining model is defined by the combination of one parameter for each of
Time
,Source
,Past
,Durl
, and a maximum hamming distance between source string (or substring) and destination string. This class represents such a model. It defines a couple of utility functions used inClusterMinerMixin
(find_start()
andpast_surls()
), and avalidate()
method which determines if a given substitution conforms to the model. Other methods, prefixed with an underscore, are utilities for the methods cited above.Parameters: time :
Time
Type of time defining how occurrence bins of the model are positioned.
source :
Source
Type of quotes that the model accepts as substitution sources.
past :
Past
How far back does the model look for substitution sources.
durl :
Durl
Type of quotes that the model accepts as substitution destinations.
max_distance : int
Maximum number of substitutions between a source string (or substring) and a destination string that the model will detect.
Raises: Exception
If max_distance is more than half of
MT_FILTER_MIN_TOKENS
.-
_Model__key
()¶ Unique identifier for this model, used to compute e.g. equality between two
Model
instances.
-
_distance_start
(source, durl)[source]¶ Get a (distance, start) tuple indicating the minimal distance between source and durl, and the position of source‘s substring that achieves that minimum.
This is in fact an alias for what the model considers to be valid transformations and how to define them, but provides proper encapsulation of concerns.
-
_past
(cluster, durl)[source]¶ Get an
Interval
representing what this model considers to be the past before durl.See
Time
andPast
to understand what this interval looks like. This method ismemoized()
for performance.
-
_validate_base
(source, durl)[source]¶ Check that source has at least one occurrence in what this model considers to be the past before durl.
-
_validate_distance
(source, durl)[source]¶ Check that source and durl differ by no more than self.max_distance.
-
_validate_durl
(source, durl)[source]¶ Check that durl is an acceptable substitution destination occurrence for this model.
This method proxies to the proper validation method, depending on the value of self.durl.
-
_validate_source
(source, durl)[source]¶ Check that source is an acceptable substitution source for this model.
This method proxies to the proper validation method, depending on the value of self.source.
-
bin_span
= datetime.timedelta(1)¶ Span of occurrence bins the model makes.
-
drop_caches
()[source]¶ Drop the caches of all
memoized()
methods of the class.
-
find_start
(source, durl)[source]¶ Get the position of the substring of source that achieves minimal distance to durl.
-
past_surls
(cluster, durl)[source]¶ Get the list of all
Url
s that are in what this model considers to be the past before durl.This method is
memoized()
for performance.
-
validate
(source, durl)[source]¶ Test if potential substitutions from source quote to durl destination url are valid for this model.
This method is
memoized()
for performance.Parameters: source :
Quote
Candidate source quote for substitutions; the substitutions can be from a substring of source.string.
durl :
Url
Candidate destination url for the substitutions.
Returns: bool
True if the proposed source and destination url are considered valid by this model, False otherwise.
-
-
class
brainscopypaste.mine.
Past
[source]¶ Bases:
enum.Enum
How far back in the past can a substitution find its source.
-
all
= <Past.all: 1>¶ The past is everything: substitution sources can be in any bin preceding the destination occurrence (which is an interval that can end at midnight before the destination occurrence when using
Time.discrete
).
-
last_bin
= <Past.last_bin: 2>¶ The past is the last bin: substitution sources must be in the bin preceding the destination occurrence (which can end at midnight before the destination occurrence when using
Time.discrete
).
-
-
class
brainscopypaste.mine.
Source
[source]¶ Bases:
enum.Enum
Type of quotes accepted as substitution sources.
-
all
= <Source.all: 1>¶ All quotes are potential sources for substitutions.
-
majority
= <Source.majority: 2>¶ Majority rule: only quotes that are the most frequent in the considered past bin can be the source of substitutions (note that several quotes in a single bin can have the same maximal frequency).
-
-
class
brainscopypaste.mine.
SubstitutionValidatorMixin
[source]¶ Bases:
object
Mixin for
Substitution
that adds validation functionality.A non-negligible part of the substitutions found by
ClusterMinerMixin
are spam or changes we’re not interested in: minor spelling changes, abbreviations, changes of articles, symptoms of a deleted word that appear as substitutions, etc. This class defines thevalidate()
method, which tests for all these cases and returns whether or not the substitution is worth keeping.
-
class
brainscopypaste.mine.
Time
[source]¶ Bases:
enum.Enum
Type of time that determines the positioning of occurrence bins.
-
continuous
= <Time.continuous: 1>¶ Continuous time: bins are sliding, end at the destination occurrence, and start
Model.bin_span
before that.
-
discrete
= <Time.discrete: 2>¶ Discrete time: bins are aligned at midnight, end at or before the destination occurrence, and start
Model.bin_span
before that.
-
-
brainscopypaste.mine.
_get_wordnet_words
()[source]¶ Get the set of all words known by WordNet.
This is the set of all lemma names for all synonym sets in WordNet.
-
brainscopypaste.mine.
mine_substitutions_with_model
(model, limit=None)[source]¶ Mine all substitutions in the MemeTracker dataset conforming to model.
Iterates through the whole MemeTracker dataset to find all substitutions that are considered valid by model, and save the results to the database. The MemeTracker dataset must have been loaded and filtered previously, or an excetion will be raised (see Usage or
cli
for more about that). Mined substitutions are saved each time the function moves to a new cluster, and progress is printed to stdout. The number of substitutions seen and the number of substitutions kept (i.e. validated bySubstitutionValidatorMixin.validate()
) are also printed to stdout.Parameters: model :
Model
The substitution model to use for mining.
limit : int, optional
If not None (default), mining will stop after limit clusters have been examined.
Raises: Exception
If no filtered clusters are found in the database, or if there already are some substitutions from model model in the database.