Database models¶
Database models and related utilities.
This module defines the database structure underlying storage for the analysis. This consists in models that get turned into PostgreSQL tables by SQLAlchemy, along with a few utility classes, functions and exceptions around them.
Cluster
and Quote
represent respectively an individual
cluster or quote from the MemeTracker data set. Url
represents a quote
occurrence, and those are stored as attributes of Quote
s (as opposed
to in their own table). Substitution
represents an individual
substitution mined with a given substitution Model
.
Each model (except Url
, which doesn’t have its own table) inherits the
BaseMixin
, which defines the table name, id field, and provides a
common clone()
method.
On top of that, models define a few computed properties (using the
utils.cache
decorator) which provide useful information that doesn’t
need to be stored directly in the database (storing that in the database would
make first access faster, but introduces more possibilities of inconsistent
data if updates don’t align well). Cluster
and Substitution
also inherit functionality from the mine
, filter
and
features
modules, which you can inspect for more details.
Finally, this module defines save_by_copy()
, a useful function to
efficiently import clusters and quotes in bulk into the database.
-
class
brainscopypaste.db.
ArrayOfEnum
(item_type, as_tuple=False, dimensions=None, zero_indexes=False)[source]¶ Bases:
sqlalchemy.dialects.postgresql.base.ARRAY
ARRAY of ENUMs column type, which is not directly supported by DBAPIs.
This workaround is provided by SQLAchemy’s documentation.
-
class
brainscopypaste.db.
BaseMixin
[source]¶ Bases:
object
Common mixin for all models defining a table name, an id field, and a clone() method.
-
clone
(**fields)[source]¶ Clone a model instance, excluding the original id and optionally setting some fields to values provided as arguments.
Give the fields to override as keyword arguments, their values will be set on the cloned instance. Any field that is not a known table column is ignored.
-
id
= Column(None, Integer(), table=None, primary_key=True, nullable=False)¶ Primary key for the table.
-
-
class
brainscopypaste.db.
Cluster
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
,brainscopypaste.db.BaseMixin
,brainscopypaste.filter.ClusterFilterMixin
,brainscopypaste.mine.ClusterMinerMixin
Represent a MemeTracker cluster of quotes in the database.
Attributes below are defined as class attributes or
cache
d methods, but they appear as instance attributes when you have an actual cluster instance. For instance, if cluster is aCluster
instance, cluster.size will give you that instance’ssize
.-
filtered
¶ Boolean indicating whether this cluster is part of the filtered (and kept) set of clusters or not.
-
format_copy
()[source]¶ Create a string representing the cluster in a
cursor.copy_from()
or_copy()
call.
-
format_copy_columns
= ('id', 'sid', 'filtered', 'source')¶ Tuple of column names that are used by
format_copy()
.
-
frequency
¶ Complete number of occurrences of all the quotes in the cluster (i.e. counting url frequencies).
Look at
size_urls
for a count that ignores url frequencies.
-
quotes
¶ List of
Quote
s in this cluster (this is a dynamic relationship on which you can run queries).
-
sid
¶ Id of the cluster that originated this instance, i.e. the id as it appears in the MemeTracker data set.
-
size
¶ Number of quotes in the cluster.
-
size_urls
¶ Number of urls of all the quotes in the cluster (i.e. not counting url frequencies)
Look at
frequency
for a count that takes url frequencies into account.
-
source
¶ Source data set from which this cluster originated. Currently this is always memetracker.
-
-
class
brainscopypaste.db.
ModelType
(*args, **kwargs)[source]¶ Bases:
sqlalchemy.sql.type_api.TypeDecorator
Database type representing a substitution
Model
, used in the definition ofSubstitution
.-
impl
¶ alias of
String
-
-
class
brainscopypaste.db.
Quote
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
,brainscopypaste.db.BaseMixin
Represent a MemeTracker quote in the database.
Attributes below are defined as class attributes or
cache
d methods, but they appear as instance attributes when you have an actual quote instance. For instance, if quote is aQuote
instance, quote.size will give you that instance’ssize
.Note that children
Url
s are stored directly inside this model through lists of url attributes, where a given url is defined by items at the same index in the various lists. This is an internal detail, and you should use theurls
attribute to directly get a list ofUrl
objects.-
add_url
(url)[source]¶ Add a
Url
to the quote.The change is not automatically saved. If you want to persist this to the database, you should do it inside a session and commit afterwards (e.g. using
session_scope()
).Parameters: url :
Url
The url to add to the quote.
Raises: SealedException
-
add_urls
(urls)[source]¶ Add a list of
Url
s to the quote.As for
add_url()
, the changes are not automatically saved. If you want to persist this to the database, you should do it inside a session and commit afterwards (e.g. usingsession_scope()
).Parameters: urls : list of
Url
sThe urls to add to the quote.
Raises: SealedException
-
cluster_id
¶ Parent cluster id.
-
filtered
¶ Boolean indicating whether this quote is part of the filtered (and kept) set of quotes or not.
-
format_copy
()[source]¶ Create a string representing the quote and all its children urls in a
cursor.copy_from()
or_copy()
call.
-
format_copy_columns
= ('id', 'cluster_id', 'sid', 'filtered', 'string', 'url_timestamps', 'url_frequencies', 'url_url_types', 'url_urls')¶ Tuple of column names that are used by
format_copy()
.
-
frequency
¶ Complete number of occurrences of the quote (i.e. counting url frequencies).
Look at
size
for a count that ignores url frequencies.
-
sid
¶ Id of the quote that originated this instance, i.e. the id as it appears in the MemeTracker data set.
-
size
¶ Number of urls in the quote.
Look at
frequency
for a count that takes url frequencies into account.
-
span
¶ Span of the quote (as a
timedelta
), from first to last occurrence.Raises: ValueError
If no urls are defined on the quote.
-
string
¶ Text of the quote.
-
substitutions_destination
¶ List of
Substitution
s for which this quote is the destination (this is a dynamic relationship on which you can run queries).
-
substitutions_source
¶ List of
Substitution
s for which this quote is the source (this is a dynamic relationship on which you can run queries).
List of TreeTagger POS tags of the tokens in the quote’s
string
.Raises: ValueError
If the quote’s
string
is None.
-
url_frequencies
¶ List of ints representing the frequencies of children urls (i.e. how many times the quote string appears at each url).
-
url_urls
¶ List of strs representing the URIs of the children urls.
-
-
exception
brainscopypaste.db.
SealedException
[source]¶ Bases:
Exception
Exception raised when trying to edit a model on which
cache
d methods have already been accessed.
-
class
brainscopypaste.db.
Substitution
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
,brainscopypaste.db.BaseMixin
,brainscopypaste.mine.SubstitutionValidatorMixin
,brainscopypaste.features.SubstitutionFeaturesMixin
Represent a substitution in the database from one
Quote
to another.A substitution is the replacement of a word from one quote (or a substring of that quote) in another quote. It is defined by a
source quote
, anoccurrence
of adestination quote
, theposition of a substring
in the source quote string, theposition of the replaced word
in that substring, and thesubstitution model
that detected the substitution in the data set.Attributes below are defined as class attributes or
cache
d methods, but they appear as instance attributes when you have an actual substitution instance. For instance, if substitution is aSubstitution
instance, substitution.tags will give you that instance’stags
.-
destination_id
¶ Id of the destination quote for the substitution.
-
lemmas
¶ Tuple of lemmas of the replaced and replacing words.
-
position
¶ Position of the replaced word in the substring of the source quote (which is also the position in the destination quote).
-
source_id
¶ Id of the source quote for the substitution.
-
start
¶ Index of the beginning of the substring in the source quote.
Tuple of TreeTagger POS tags of the replaced and replacing words.
-
tokens
¶ Tuple of the replaced and replacing words (the tokens here are the exact replaced and replacing words).
-
-
class
brainscopypaste.db.
Url
(timestamp, frequency, url_type, url, quote=None)[source]¶ Bases:
object
Represent a MemeTracker url in a
Quote
in the database.The url
occurrence
is defined below as acache
d method, but it appears as an instance attribute when you have an actual url instance. For instance, if url is aUrl
instance, url.occurrence will give you that url’soccurrence
.Note that
Url
s are stored directly insideQuote
instances, and don’t have a dedicated database table.Attributes
quote ( Quote
) Parent quote.timestamp ( datetime
) Time at which the url occurred.frequency (int) Number of times the quote string appears at this url. url_type ( url_type
) Type of this url.url (str) URI of this url.
-
brainscopypaste.db.
_copy
(string, table, columns)[source]¶ Execute a PostgreSQL COPY command.
COPY is one of the fastest methods to import data in bulk into PostgreSQL. This function executes this operation through the raw psycopg2
cursor
object.Parameters: string : file-like object
Contents of the data to import into the database, formatted for the COPY command (see PostgreSQL’s documentation for more details). Can be an
io.StringIO
if you don’t want to use a real file in the filesystem.table : str
Name of the table into which the data is imported.
columns : list of str
List of the column names encoded in the string parameter. When string is produced using
Quote.format_copy()
orCluster.format_copy()
you can use the correspondingQuote.format_copy_columns
orCluster.format_copy_columns
for this parameter.See also
-
brainscopypaste.db.
save_by_copy
(clusters, quotes)[source]¶ Import a list of clusters and a list of quotes into the database.
This function uses PostgreSQL’s COPY command to bulk import clusters and quotes, and prints its progress to stdout.
Parameters: clusters : list of
Cluster
sList of clusters to import in the database.
quotes : list of
Quote
sList of quotes to import in the database. Any clusters they reference should be in the clusters parameter.
See also
-
brainscopypaste.db.
url_type
= Enum('B', 'M', name='url_type', metadata=MetaData(bind=None))¶ sqlalchemy.types.Enum
of possible types ofUrl
s from the MemeTracker data set.