SQL/MySQL Module
Work Q27 and explain aggregates in GROUP BY versus SELECT (i.e. in select if you use aggregate operation you can only use aggregate operators; elsewise you’ll need to nest see Q27, or use a GROUP BY).
Work 5.2.6, 5.2.9 (working examples in my notes).
Assign them to carefully read and work examples in chapter 5 of Ramakrishnan. Especially review section 5.5 (aggregate operators, group by). Work examples in class. Important points to review:
CONSTRAINTS [SQL/MySQL module]
Ask them to list the ways you can enforce integrity: (think back to how you did in Access for ideas)
Understanding Integrity Constraints exercise. four groups (domain/type constraints, table check constraints, assertions, triggers).
talk about choice of which to use (review 5.9).
domain constraints:
distinct types
table constraints:
Use the CHECK on attributes, domains, and tuples if it meets conditions above, as it’s most efficient, and implemented directly by DBMS system. If you can’t, and need more powerful expression then go to ASSERTION or TRIGGER (Elmasri p257).
assertions:
triggers:
MySQL and constraints:
Access: doesn't support any of these
DB Indexing and Tuning
module.
Indexing.
Cover from 256 notes.
Primary topics are
Tuning from chapter 20. Introduce by using example of Internet bookstore (Barns & Nibbles). Do design and implementation process to make indexes as they see fit for the internet bookstore. Use the ER diagram (tables) from figure 2.20 page 50 in Ramakrishnan. Work and discuss
Query optimization quick summary of heuristics commonly used:
Related to Looking up orders by customer ID query: Several months later customer called in and said “Clients complain that it’s too slow to respond to what is the status of their orders”. (solution: separate completed orders from live orders tables so you can respond quickly to the smaller “live/outstanding” orders table).
Tell them a good resource is MySQL manual 7.4.5 How MYSQL Uses Indexes, and the reader comments are informative too.
GUIDELINES
How do we go about figuring out what index choices to make???
Book provides great guidelines. However, it implicitly also assumes you’re taking into account query optimization. See Ramakrishnan guidelines in 20.1, 20.2. We’ll talk about additional assumptions, and show by working through examples and explainations. We’ll also cover some examples of how to do in MySQL.
Generate workload description:
For each query in the workload, we must identify
Similar for updates.
Choice of Index:
Choice of conceptual schema:
Index Choice
Guidelines
Whether to index? Only if some query or update would benefit. Choices indexes that speed up more than one query.
Choice of Search Key
Multi-attribute Search Keys Consider when
Whether to Cluster? At most one index on a given relation can be clustered.
Hash versus B+ tree B+Tree is generally better (supports range and equality). Use hash when:
Impact of Index Maintenance draw up wish list, consider impacts
If maintaining slows down updates, consider dropping.
Keep in mind that index can speed up updates
Example 1: page 656
SELECT E.ename, D.mgr
FROM Employees E. Dept D
WHERE D.dname = ‘Toy’ AND E.dno = D.dno
Walk through using the guidelines.
Relations (attributes) involved: Employees (dname, dno), Dept (dno, mgr). Dept.mgr only for reporting, rest for selection. So we have E.dname, E.dno, D.dno. Think about query processing; what does it tell us? Most restrictive are WHERE conditions (D.dname = ‘Toy’, then join conditions E.dno = D.dno). So most important would be index on D.dname. Since equality choose to make Hash index on D.dname. What about E.dno and D.dno? Nothing is gained by indexing D.dno since you already have indexed and retrieved matching records by D.dname index. You just need to match these against E.dno. So index on E.dno, again using hash index since equality.
Example 2: page 656
SELECT E.ename, D.mgr
FROM Employees E. Dept D
WHERE D.dname = ‘Toy’ AND E.dno = D.dno AND E.age =25
Multiple choices: We still do hash index on D.dname for same reasons. But now do we do join with E first, then selection on E.age or do we do selection on E.age=25 and then join? It would help to know how restrictive E.age=25 is. If very small then maybe do this first, then join. Also, if for another query we want to have index on E.age, then use it and do join last. If neither of these is true then maybe better to do same has index on E.dno to do join first, and the WHERE condition of E.age=25 on the fly (as you look at the join records).
Example 3: page 657
SELECT E.ename, D.dname
FROM Employees E. Dept D
WHERE E.sal BETWEEN 10000 AND 20000
AND E.dno = D.dno AND E.hobby=’STAMPS’
Hash D.dno (only D attribute). Which of E to do? Sal, hobby, dno. Best to choose to index by one of restrictive Sal or hobby, then will have E.dno as part of record, which you then compare to inner loop selection of D.dno which should be hashed for fast access since equality condition. May not be able to tell which condition is more restrictive, Salary or hobby. Use estimations, or better yet, database statistics after it’s been running for a while. If this query is very important, then maybe best to build both, so that database query optimizer can make best choice given the two access mechanisms. How could we build both (best individual choice for hobby would be hash because of equality; best for salary is likely clustered ordered index file because of range condition E.sal BETWEEN 10000 AND 20000. Can these two co-exist? No, they require two physically different structures (hash, ordered by sal). Instead could keep ordered by sal (physical file), and then have secondary index on hobby. So choose Hash Dept on dno, and clustered B+tree (physical ordering) on E.sal (range) and secondary index B+ tree on E.hobby.
Example 4: how to save index space while keeping same
(similar speed up) for when you can use just first part of longer character
string
Normal MySQL CREATE INDEX
CREATE INDEX index_x ON table_name (attribute);
CREATE INDEX INDEX_subject ON metadata(subject);
Using shortened index….
CREATE INDEX part_of_name ON table_name (attribute(20)); uses only first 20 char
CREATE INDEX shortname ON Employee (lastname(10)); uses only first 10 char
Example 5: Co-clustering
SELECT P.id, A.componentID
FROM Parts P, Assembly A
WHERE P.pid=Apartid AND P.cost=10
Suppose many parts have cost 10. Then best situation is to have index on Parts.cost, and be able to find the matching assembly records. This can be done by co-clustering in the database file itself the assembly records with the corresponding parts records having the same PartID. Then we can search out P by index on cost (in this problem), and once there we automatically have the assembly parts. This saves 2-3 index page I/O operations. So if this was critical action it might be worth it.
|
P.pid=5 |
P.pname=wheel |
P.cost=9 |
P.supplierid = 15 |
|
A.partid=5 |
A.componentid=18 |
A.quantity=2 |
|
|
A.partid=5 |
A.componentid=19 |
A.quantity=1 |
|
|
P.pid=6 |
P.pname=door_handle |
P.cost=18 |
P.supplierid = 19 |
|
A.partid=6 |
A.componentid=229 |
A.quantity=3 |
|
|
A.partid=6 |
A.componentid=10 |
A.quantity=12 |
|
|
A.partid=6 |
A.componentid=18 |
A.quantity=2 |
|
|
A.partid=6 |
A.componentid=11 |
A.quantity=1 |
|
|
P.pid=7 |
P.pname=light |
P.cost=180 |
P.supplierid = 10 |
|
A.partid=7 |
A.componentid=30 |
A.quantity=1 |
|
Example 6: Index only
The beauty of index only solutions are that they save you from going out to the actual database file, by being able to calculate the results (partial or final) from just the index itself.
Ask them to very efficiently report all the parts made by a given supplier X.
SELECT P.pid
FROM Parts P
WHERE P.supplierid = X
If we make an index on Parts of (supplierid,pid) then we can answer this very efficiently (binary scan to find supplierid, the just report the list of values from pid fields from the index itself!
|
SupplierID |
Pid |
Pointer_to_Memory |
|
…. |
|
272834928 |
|
7 |
10 |
293874248 |
|
7 |
16 |
293874248 |
|
8 |
12 |
224324248 |
|
8 |
16 |
224324248 |
|
8 |
18 |
224324248 |
|
9 |
5 |
224324248 |
|
…. |
|
26838389 |
The catch is that you need to think carefully about what needs to be in the index to do index only solutions. For instance, you can’t do with Hash (no physical index). You must have all the values (dense) to search over all values, or do aggregate calculations over the values.
A common index only type example is when you have two tables, where you do joins across two indexes.
List the dept number
and manger’s name for all departments
SELECT D.dno, E.ename
FROM Departments D, Employees E
WHERE D.managerid=E.eid
How do we process this query? The department table probably fits in one disk page so is easily completely retrieved. We just need to step through it, and then find matching employee record for the manager of the department. Can do with index only check of index on E.eid. So index Employees on eid, then can test this as inner loop join matching the D.mangerid value of records as you step through the dept records. But, we need to report ename as well. So we would need ename from the Employee complete table on disk. If we want to do this as fast as possible we could avoid going out to disk completely by having the index of (eid,ename) on E, and index (dno,mangerid) on D. We can then do the join comparison across the two index files (without going to the database itself), and report D.dno and E.ename from the matched index records. Since we’re looping through the index file records for both D and E, we have to have index files not hash tables.
List the maximum
salary of employees at each age (…,28, 29,30, ..)
SELECT MAX(E.salary)
FROM Employees E
GROUP BY E.age
Index dense (to get all values) by E(age,salary). Then index only (no disk access), and very efficient as just have to scan the records in a row at same age.
Note another related speed-up is because even if you cannot do index only solution, you may be able to reduce disk accesses because you can figure out which “unique” disk blocks are needed by looking at index only (and just retrieve a disk block once, even though you might have multiple records in single block).
Talk about Index wizards. Had because large set of possibilities. Hard to determine optimal mix. Index advisor are common on commercial products (DB2, Microsoft SCQL ones are mentioned in book). Estimate candidate indexes, then calculate efficiencies based on Database schema and queries.
For fulltext searching: MySQL index type “FULLTEXT”. Then use MATCH() function to generate relevance ranked full text matches.
Example
CREATE FULLTEXT INDEX FT_INDEX_subject ON metadata(subject);
Note: this will take a number of minutes!!!
Index Types (MySQL):
PRIMARY KEY, UNIQUE, FULLTEXT: B-trees
MEMORY: hash
SPATIAL column types : R-trees
Overview of Database
Tuning
Reasons to Tune: usage of the DB changes, have more information (statistics) available, or queries are slower than expected (request DB show plan for query optimization, and it shows not using index as expected).
Tuning Indexes:
Used to be separate optimizers from database engine (that optimized queries). This was problematic because the need to be tightly integrated. Recently, they have become integrated (new releases come out together, and Index Tuner uses information from DB engine query optimizer).
Tuners must choose “best” indexes from very large set of possible indexes. Relatively impossible (not cost efficient) to evaluate all of them. So use information from optimizer, knowledge of queries (logs showing most often used queries, etc). Generate candidate indexes (SQL code to generate them), which the DB admin may accept or manually select from. Examples in book are DB2 Index Advisor, and MicroSoft SQL Server 2000. There are some 3rd party solutions for MySQL (MyProfiler).
Tuning conceptual schema: Schema may not be optimal given usage, changes. Decomposition types:
Vertical Partitioning: reduce normalization, i.e. settle for 3NF instead of BCNF, or maybe not even 3NF. Say add fields to certain relations to speed up queries.
Horizontal Partitioning: two relations with same schema (say orders table from Barns and Nibbles, with “completed” and the “outstanding” orders).
Tuning queries: Rewrite query to speed up.
For additional practical details on optimization with MySQL databases, see the optimization chapter in the MySQL documentation.
Database Application
Development and Internet Applications module
Discuss working with genomic database system, which has many of the processes (nucleotide sequence matching, similarity comparisons) available in programming language, but you cannot accomplish in SQL. But the main databases (nucleotide sequences, annotations, etc) are in Oracle. How do you interface the two? Your boss asks you to choose between the options of embedded SQL, dynamic SQL, or this new thing JDBC? And will cursors be necessary??
Important to combine formal databases (optimized for storage, integrity, SQL queries with additional power of programming languages). How do you achieve this:
Embedded SQL
SQL commands are “embedded” into programming language (C, Java). A DBMS specific preprocessor transforms the SQL commands into the host language (followed by the language processor). Thus, the resulting executable code works only for that DBMS specific machine, although it could be recompiled for other machines. A more system independent solution is ODBC and JDBC (see below).
Example in C
EXEC SQL
INSERT INTO Sailor VALUES (:c_sname, :c_sid, :c_rating, c_age);
:variable name to refer to variable from host program (see p188 for how to declare them to be accessible to SQL commands.
Besides mechanisms for handling access to and definition of variables, there also needs to be handling of error conditions SQLCODE and SQLSTATE (newer). This variable allows the return of error conditions in standard format.
Dynamic SQL
Prepare a SQL command in programming language and then execute it. Example
Char command[] = {“DELETE FROM Sailors WHERE rating > 5”}; /* creates cmd */
EXEC SQL PREPARE readytogo FROM :c_sqlstring; /* parses and compiles */
EXEC SQL EXECUTE readytogo; /* executes */
/* more realistic example, where we use dynamic values to create dynamic command */
sprintf(command_string, “DELETE FROM Sailors WHERE rating > %d”, rating);
Disadvantage: Dynamic SQL has runtime overhead versus Embedded SQL which is prepared (compiled).
Cursors
Needed because of impedance mismatch: SQL operates on sets of records, while languages like C do not cleanly support mechanisms for operating on sets of records. (open a set, fetch next record, move to specific record position, close a set). A cursor can be thought of as ‘pointing’ to a row in the collection of answers to the query associated with it.
DECLARE sinfo CURSOR FOR
SELECT S.sname, S.age
FROM Sailors S
WHERE S.rating > :c_minrating;
OPEN sinfo;
While (1) {
FETCH sinfo INTO :c_sname, :c_age;
/* check SQLSTATE = NO_DATA exit loop when true (i.e. no more data) */
}
ODBC Open Database Connectivity
JDBC Java Database Connectivity
Both expose generic database capabilities through an Application Programming Interface (API). This allows a single executable program to connect to different DBMS and systems, and to have multiple connections to the same or different DBMS active at once. This is in contrast to embedded SQL which is tied to specific programming language, and recompiling would be required to connect to different DBMS.
SQLJ (SQL JAVA)
Similar to JDBC, but more like embedded SQL. SQL J code is easier to read than JDBC. Important difference: vendors provide their own proprietary versions of SQL. To be independent of vendors it’s important to follow SQL-92 or SQL-99 standards. Using Embedded SQL allows the programmer to use vendor specific constructs (which means less independent of platform), while using SQLJ and JDBC for adherence to the standards, resulting in the code being more portable across platforms.
Stored Procedures
Defining a named procedure (programming code module that accomplishes some specific task, for instance comparing two nucleotide sequences and returning a similarity score), that resides on the database server and can be executed there. Advantages: more efficient than moving large dataset from DB server to client to do processing, especially when the rest of the interactions are on the server, and the server is probably handles large datasets more efficiently. Also, once the procedure is defined, other users can take advantage of it. Leads to development of specialized classes (datatypes, operations) for things like video, genetics, etc.
Chapter 7 Internet Applications
HTTP
HTML/XML/SGML
1 tier vs 2 tier vs 3 tier
Advantages of 3 tier system:
Heterogeneous systems:
utilize strengths of different packages, easy to replace modify
Thin clients: only
need minimal computation power (typical client is web browser)
Integrated data
access: access data from many
different sources transparently at middle tier.
Scalability to many clients: thin clients = low overhead, middle tier used shared/pooled resources
Software development: modularizes the code development, uses standard interfaces, APIs.
Presentation Layer:
Technologies: HTML forms, JavaScript, Style Sheets
HTML forms:
Communication between presentation and application tiers: GET (form contents goes in query URI, and is visible to user, and bookmarkable) or POST (form contents transferred privately in separate data block).
action?name=value1&name2=value2?name3=value3 /* if no action default to name */
page.jsp?username=John+Doe&password=secret
Javascript:
Scripting language at client tier with which we can add programs to webpages that run directly on the client.
Browser detection: can customize actions based on browser type
Form Validation: can perform consistency checks on the data
Browser control: ability to open pages in customized windows (i.e. annoying pop-ups)
Style Sheets: ways to adapt the same content for different presentations formats (think PDA, versus laptop, versus wall projector, B&W versus color display).
CSS: Cascading Style Sheets
XSL: (XML based version, more powerful), contains
XSLT (XSL Transformation)
XPATH (XML Path)
Important IDEA: separation of Content from Presentation (XML vs HTML, CSS, XSL)
Style sheets also separate the transformative aspects of the page from the rendering.
Middle Tier:
Initially general purpose programs in C, C++, Perl. Evolved to “application servers” to provide generic powerful resources, which scale well.
CGI: (Common Gateway Interface) defines how data gets from client form to middle level.
Application servers:
Draw picture from 7.16 and 7.17 page 254, to illustrate difference. (overhead with multiple CGI instances versus pool of servlets in application server).
Servlets: Java servlets are code (Java) in middle tier, ie.er webservers or application servers.
JavaServerPages (JSP): JavaServer Pages are written in HTML with servlet-like code embedded with special HTML tags. JSPs are better suited for quick customizations, interfaces, as compared with servlets which can handle more complex application logic.
Management of multiple sessions at tier levels to avoid overhead
Many languages used for web programming including
Maintaining State:
Middle/Lower tier: for data that needs to persist over multiple sessions; can do in memory (volatile), or better in most cases files at middle tier, or database records at database server tier. Examples: past customer orders (DB), click stream data recording of users movements (middle), personalized layout, etc).
Presentation tier: HTTP is stateless, so use Cookies for maintaining information (name,value pairs). Cookies are non-essential, persist across sessions because of cache, and are transferred to middle tier every session.
Internet book shop (exercise 5). Work through in class. Draft an implementation using a three tier architecture to support searching books, adding selections to a shopping basket, purchasing items in shopping basket, and giving recommendations when they bring up a selection in the search form.
XML:
Introduce XML and various concepts. Review a little of the online quiz exercise,
in particular the examples of XML code and CSS.
Show them
Define X* including…
XML
XSL
XSLT
XSL-FO
XPath
XLink
XPointer
Colliding worlds concept from Ramakrishnan (Databases, Information Retrieval, with XML helping bridge the gap). Why are they colliding now? The Web! Historically databases have been structured data managed by large application in controlled environment. Information retrieval has been similar to this in that it has involved searched a defined corpus (set of documents, perhaps of business information system). These two were mostly separate. Then along comes the web and… information searching expands to encompass diverse, separate collections searched as one (i.e. search engines on the web). Databases go online, and we see federation of many independent ones, and connections from the information retrieval systems and search engines.
Searching in IR system was usually by librarian types. Database systems also by experienced users (or more novice users through simplified interfaces). With the Web, searches are usually through very simple interfaces, and apply to wide variety of resources.
What are main
differences between Databases and IR?
Searches vs Queries: searches in IR, formal Queries in Databases. IR search results based on search terms, may be ranked; DB results are exact answer to query, and no ranking.
Updates and Transactions: in DB only.
How does XML fit in? XML came about because of the need to marry Web based interactions with data in databases. HTML is sufficient to define displayable documents, but lacks the structure necessary to represent and exchange information with a Database. So XML bridges the gap. XML is derived from SGML (by librarian types for describing document collections). XML is simple, yet powerful version for the masses, i.e. the Web. Databases always had metadata describing their content. Most IR systems did. Now the web is getting this too through XML. Thus documents may contain information about the author, descriptions of content, and relationships to other content. XML also separate content from presentation better than HTML, which mixes this.
HTML example: show how it defines display of content, but not type of content.
XML example: show how it defines the type of content; separately is a mapping of content type to presentation attributes. This allows quick and convenient changing of presentation (changing colors and bolding), and also transformation changes (to display on cellphone instead of desktop monitor).
Structured data versus semi-structured. In a structured environment (database), there is a predefined schema which the data objects must conform to. Schema and data content are in separate data structures. In a semi-structured environment, schema and data content can be mixed, and the schema information thus may be continually revised as more is read in (i.e. not fully defined before hand). This difference equates to knowing beforehand, or making the effort to define a model that represents all the content you’ll be describing. So if you’ll have a collection of documents, you’ll say before hand what the document types are and their attributes, and which are mandatory, etc. for a structured environment. For a semi-structured environment, you would define them as you go, say adding new types (web pages), and different instances of the same type (say book citations, could contain subsets of attributes of others).
XML DTD, Schema contrast simple definitions of DTD and Schema with ER diagram notation. Use company example from pages 848-854 of Elmasri.
Is XML a Database?
Discuss. Summarize results on board.
Bourett’s XML & Databases web pages has a great discussion. Answer is….
Formally yes (collection of data).
Informally, sort of.
XML DOES HAVE
XML DOES NOT HAVE
Ask them to list
examples where they would use formal DB system versus use XML.
Choice: which to use? Data centric vs Document Centric
Data centric:
Examples: all the things you’ve been using databases for J.
Document centric:
Examples: books, email, advertisements, most any handwritten XHTML document
The division between the two is increasingly blurring as DBs commonly include support for XML output, or native storage of XML documents. Similarly, XML supports storage of document fragments in external (relational) DBs.
Approaches to Storing XML documents (Elmasri p855)
XML Schemas, DTDs versus DBMS schema. First, let’s understand XML documents and XML Schemas better. Review XML Schema on w3cshools.com website and examples I put online in our class 157 xml directory, in particular the shiporder example (done in three versions) at the end of the XML schema section (in w3school and on 157 xml web directory), and the university order (primary key and foreign key definitions).
How do we decide how to construct the schema (XSD) file? From ER diagram, relational database schema. Then cover Elmarsi p850-852 to look at different representations possible for same ER diagram.
What XSD constructs do we use for
cardinality? Multivalued attributes (minOccurs, maxOccurs), Elmasri p 850-854
PK: key (Elmasri p 850-854, xml/university2 example)
FK: keyref (Elmasri p 850-854, xml/university2 example)
Union: either or (different ways, choice, or in regular expression)
Constraints on values/domains: Elmasri p 850-854, w3schools
Value Types: date, datetime, time, gDay, positiveInteger, decimal, boolean, binary
Restrictions: limit value range, specific format, enumerated values (w3school: restrictions)
Draw ER diagram for
course, section, student on board.




Now let’s look at a practical example of how the same thing
could be represented in XML Schema versus DBMS by looking at tree/ER
diagrams. Work through example in
Elmasri p856-862, for #4 below, to show conversion from relational database to
XML for simple (graph from ER diagram can be represented as tree based) and
complex (cycles in graph, which must be broken out).
Storage: When would we want to store our information in a native XML database?
<Let class answer>
Note that these are more use only if….conditions. In general I recommend relational database for most stuff that is datacentric or close to it; and using tools to directly support XML from the relational database.
As the lines between the two blur, how you convert between the two becomes increasingly important. There are two ways to map between XML and relational DB: (summarized from Bourret online readings)
Table-Based Mapping
Models the XML documents as single tables or set of tables. Used by middleware products. Can be simply generated from relational database. Requires the XML documents to have the form
<table>
<row>
<column1> </column1>
….
<columnN> </columnN>
</row>
<row> … </row>
</table>
<table> … </table>
Does this look familiar? Yes! Because it is exactly what mysqldump –X produces (Exercise 7). This is simple and works well to dump out a regular structured object from a relational database. It does add artificial construct of “rows” as containers for the records in each table. The main drawback though, is that the XML documents have to have this structured format. If two relational DB systems are sharing documents this way it works fine (although maybe we could have used more formalized mechanism like ODBC then). The problem is for handling XML based documents to store in a relational database. It is unlikely that the existing XML documents have this structure, or are easily transformed to it. Thus, there is a need to have more general solution to handle document-centric XML documents to be exchanged with relational database.
Object-Relational Mapping:
Used by all XML enabled relational databases, and some middleware products. Conceptually this allows things to map a little more appropriately (less artificially regularized than the table based maping). In general, “entity like” complex objects (XML element types with attributes, element content, or mixed content) are modeled as classes and mapped to relational database “tables”. XML “simple” things (element types with PCDATA only content (simple element types), attributes, and PCDATA) are modeled as scalar properties, which map to Relational database “attributes”. Overall
|
XML |
Object Oriented |
Relational Database |
|
Complex Element Types |
Classes |
Tables |
|
Simple Types |
Scalar properties |
attributes (columns) |
|
Key definitions |
value properties |
PK/FK pairs |
(things on the same row map to each other)
For example in the figure 26.9 above, the complex elements course, section, student would each map to object “classes” and then to relational database “tables”. (Mapping is generally XML Schema/DTD to object classes then to relational database tables; although many real-world applications compress this into a single mapping). We will NOT cover the details of this mapping because of time constraints. How this is supported by vendors varies (many different ways to implement the mappings). These details are well covered in the Bourett readings (section 3 of Mapping DTDs to Databases”). Because of several factors (sequence order not preserved in relational DB, different choices in how to implement these mappings, etc. there it not a one-to-one mapping from XML document to relational database objects. So if I send my XML documents into a relational DB, and then retrieve them, will they be the same? No, not necessarily. This isn’t generally a problem for datacentric documents, but may be for document-centric ones.
Object Model is NOT the same as DOM (Document Object Model).
DOM models the document itself, and is the same for all XML documents. (general)
Object Relational Model models data in the document and is different for each set of XML documents that conform to a given XML schema. (specific to instance).
XSLT
Use example from w3schools. Cover XSL Languages through XSLT for-each (including filtering). Have them play with modifying to make their own CD catalog version during class (if time).
What we cover here with the CD examples gives them the necessary background to do Exercise 7 (XSL file to produce output from XML dump of their mysql files from assignment 1).
Emphasize the role of XSL (XSLT to transform). We are showing the transformation mainly for the purpose of producing output; we could equally well be transforming for consumption by another agent or ingestion into a different database system. Note there is some overlap here with querying. I.e. one could use XSLT to select, filter, sort from database (like our db1_10 mysql dump). However, better way (more robust, powerful) would be to do this as queries, either in relational database system itself, or via XML query languages as we discuss in next section.
(Additional examples (no better) from Nutshell book (p140-145) but generally using w3schools examples now since freely available to students on web).
XSL-FO: formatting objects.
Allows generalized formatting for presentation. Basically, give template, and content. Template describes layout in terms of rectangular boxes on screens, with margins, sizing, flow between them.
<Draw Fig 13-1, p219 in XML Schema book>
Types of objects supported: block areas, inline areas, line areas, glyph areas.
CSS vs XSL-FO: read summary from p232 in XML Schema book.
On Assignment 5, point out potential good practice of breaking phone number into
<phone>
<areaCode>919</areacode>
<phonePrefix>555</phonePrefix>
<phoneSuffix>1234</phoneSuffix>
</phone>
Reason is that this separates content from presentation. No reason to store “(“, “)”, “-“ in XML if we are truly just representing content. Let XSLT do transformation to present properly, and UI to input properly.
Query Languages:
Relational Database Query Languages:
Proprietary languages: initial implementations, expected to be subsumed by
SQL/XML (2004 standard) (For Students-- find out current status, and products that support this).
SELECT
c.CustId,
XMLELEMENT(NAME "customer",
XMLATTRIBUTES(c.CustId AS "id"),
XMLELEMENT(NAME "name",c.Name),
XMLELEMENT(NAME "city",c.City)) AS "CustInfo"
FROM Customers c
XML query languages: can be used over any XML document. To be used with relational DB, relational DB must be modeled as XML (i.e. mapped via table based or object-relational).
· XQuery. Can use either a table-based mapping or object-relational mapping.
Table-based mapping: each table is treated as a separate document and joins between tables (documents) are specified in the query itself, as in SQL.
Object based: hierarchies of tables are treated as a single document and joins are specified in the mapping.
· XPath: an object-relational mapping must be used to do queries across more than one table because XPath does not support joins across documents. (else it would be possible to query only one table at a time).
Data Mining:
Exploratory data analysis
Knowledge discovery
Machine learning
Trying to automate the analysis of large sets of possibly correlated data.
Output can be exported in standard XML format call Predicative Markup Language (PMML).
Knowledge Discover Process:
Counting Co-Occurences
Market basket: beer diapers story.
Itemset: set of items
“support”: traction of transactions that contain all items in item set.
(70% support of {beer, diapers}).
Iceberg Queries:
Way to speed up queries with HAVING clauses by evaluating subparts. Example:
SELECT P.custid, P.item, SUM(P.qty)
FROM Purchases P
GORUP BY P.custid, P.item
HAVING SUM (P.qty) > 5
SELECT R.a, R.b, …, aggregate(R.n)
FROM Purchases R
GORUP BY R.i, R.j
HAVING AGGREGRATE (R.n) …..
Evaluate SUM(P.qty) on P.custid first, and SUM(P.qty) on P.item. Useful when you cannot fit entire relation (P) in memory to do running count. Then can be faster to limit by subset conditions, i.e. only parts requested > 5, or purchases with at least 5 items.
Association Rules:
Diapers => Beer (probability of Beer given Diapers).
ISA hierarchy, category hierarchy:
Suppose we can to test diapers => beverage. Then want to replace Beer by general category (beverage). I.e. beer ISA beverage.
Sequences: Patterns over time. Cereal, Cereal, Milk.
Sequence matching: can compute distance of records in the database from a given record. More efficient to to use indexes to speed up, where you index on attributes of interest.
Classification Trees: group by trees into hierarchial sets.
Clustering: Partioning a set into groups such that records in a group are similar to each more than records between groups (by some measure).
Data Mining Algorithms:
Commercial DataMining Systems:
SAS Enterprise Miner, SPS Clementine, IBM’s Intelligent Miner, MicroSoft’s SQL Server 200 Analysis Sever. See full list on Elmarsi p893.
SQL/MM: Data Mining extension of the SQL:1999 standard supports four kinds of data mining models
New datatypes supported included
Model: DM_RegressionModel, DM_CLusteringModel
Input parameters for DM algorithms: (DM_RegTask, DM_CLusTask),
Describe the input data (DM_LogicalDataSpec, DM_MinniData)
Represent the result: (DM_RegResults, CM_ClusResult)
Data Warehousing:
Subject oriented
Integrated
Non-volatile
Time-variant
Different from transaction based DBs:Motivation is for high level organizational level decision making (DW) versus detail oriented transaction processing (relational DB).
Reasons to have separate system (shortcomings of relational DBs) (Ramankrishnan p848)
Three broad classes of tools support this
Functionally data warehouses have the following
distinctive characterstics
Multi-dimensional conceptual view
Generic dimensionality
Unlimited dimensions and aggregation levels
Unrestricted cross-dimensional operations
Dynamic sparse matrix handling
Client server architecture
Multi-user support
Accessibility
Transparency
Intuitive data manipulation
Consistent reporting performance
Flexible reporting
Although most are centralized, there is an increasing number of Distributed Warehouses (Federated Warehouses).
Multi-dimensional models (p 903 Elmasri) product X region X Quarter
Pivoting (also called rotation)
Interactive querying is assumed.
Roll-up display: roll up on products of certain type, months to quarters to years
Drill-down display: reverse
Dimension Table: types of attributes of that dimension; coverage of range of dimension, for instance times (days, holidays, weeks, months, years, decades…).
Fact table: tuples, one per fact. Contains raw data, and pointer to dimension table
Multi-dimensional schemas: star, snowflake (draw diagrams from 905 Elmasri). Star is more common because not much reason to normalize the dimension tables, while it is big advantage to keep data together for efficiency in query response times.
Building a data warehouse:
Additional properties in SQL99:
DataWarehouses are similar in ways to views in relational databases. How are they different?
Materialized Views:
In between view and table
Materialized means stored and can create index on it; but have to update to be accurate when changes in underlying table occur.
Indexing:
Different indexes are often used for data warehouses. In particular bitmap indexes so that WHERE conditions can be done very efficiently. For instance gender, rating are attributes for person. Store as bitmap, two bits for M/F, four bits for four rating levels. Then can do bit wise comparisons.
Current research areas in Data Warehousing is mostly aimed at automated areas where lots of manual intervention and expertise is required:
Data acquisition
Data quality management
Selection and construction of appropriate access paths and structures
Self-maintainability
Functionality
Performance optimization
Data warehousing support:
Mostly done with big production environments (Oracle,
Microsoft Server 2000). However,
increasing number of MySQL servers used for warehousing, and increasing number
of packages supporting this. Google
“MySQL data warehousing” and look at
links (OReilly, Golden, Radwin, listservs).
Most recommend Star configuration; use views, batch loading, rollup
support.
Non Text Databases:
“Non-text” databases is misnomer. What are currently defined types? Let class answer
Text: char strings of different lengths; very long character objects (Access “memos”)
Numeric: Integer, Float, …
DateTime:
So we really mean Databases that can store, search, retrieve types (and possibly process) objects types not normally included as default types in current databases.
What types might we want to have definitions for? Let class answer
Why would we want to have these as defined types in our database? Why not just store as BLOBs in database, or as files in file system with references to them in database?
Challenge: what datatypes should additionally be defined, and what operations should be defined on them? Think about images for instance? What format do we store? What operations are defined (greater than? Less than? Equal to? Brighter/Darker? Contains? Contains what? (subimage, feature).
Database vendors must weigh whether market demand requires
Do exercise 8 to think about datatypes and operations.
What levels are operators are defined on (entire image, subimage, feature, pixel)?
What operators?
Cover an area not chosen by the students for their presentations, and use it to more fully cover issues with non-text databases.
Genomics:
What makes genomics different? (From Elmasri p938-939)
Where is this going?
Oracle and others are putting lots of effort into supporting non-text databases. Adding new datatypes, or mechanisms for user defined datatypes and operations on them. Example is BLAST searches. Have datatypes of nucleotide sequence. Add ability to do “matches”, which implies substring match, but with “close” matching where there are errors or missing parts in sequences (with ranking based on biological likelihood). Support now exists for multimedia, flexible text objects. For instance read article (google Oracle genome support, Cracking the code of life, http://www.oracle.com/technology/oramag/oracle/03-jan/o13science.html).
Gene Ontologies: done in MySQL and open source tools; exported in XML or SQL.
Short history review:
1970s: hierachial and network databases are the rule. Databases are large company supported resources, accessed via fixed point terminal connections, and structured interfaces.
1980s: emergence of relational databases. Birth and rapid growth of workstations, personal computers, beginnings of individual databases.
1990s: realization that relational databases didn’t meet all needs, extensions to relational databases, object oriented databases, data mining, etc.
2000s: the web is the default for most everything. Databases are connected to the web and to each other, and accessible by the masses. Most all information in contained in some sort of “database”, interaction is through web forms (yellow pages, travel reservations, stores, blogs, classes). Access to a database is essentially free for most people (Microsoft Access, MySQL). Three tier systems: intelligence at client (visualization), processing done mostly at middle tier (application level), data storage is on server tier. Relational databases still dominate, in part because vendors increasing support additional capabilities (object oriented, XML, data mining, data warehousing, beyond text (audio,video, etc)). Interfaces extending from standard computers to PDAs, cellphones, etc.