Wolfgang Pauli Institute (WPI)

	Wolfgang Pauli Institute (WPI) Vienna
Home	WPI in a nutshell	Practical Information	Events	People	WPI Projects
Login	Thematic Programs	Pauli Fellows	Talks	Research Groups	Math AI/ML @ WPI

Semi-Structured Data (2004)

Organizers: Georg Gottlob (Vienna)

Talks

Gyorgy Turan	Inst. of Informations Systems, TU Wien, Favoritenstr.9-11, Seminarroom 1842, 3rd floor out of the elevator turn left, through the corridor, then turn right)	Wed, 20. Oct 04, 16:15
"Learnability and definability in trees and similar structures"
he relationship between the expressiveness of a logic and the complexity of the associated computational problems is a much studied topic in logic and computer science. We discuss a related issue: the learnability aspect of the different logic formalisms, motivated by the fact that logic is a basic framework for machine learning (e.g., inductive logic programming). We study combinatorial parameters of the concept classes of definable sets in finite structures, corresponding to PAC and query learning: the Vapnik-Chervonenkis (VC) dimension and the strong consistency dimension. For example, positive results are given for the VC-dimension for monadic second order (MSO) logic over structures of bounded clique-width, resp., for the strong consistency dimension for MSO over trees. The proofs are based on bounds for related definability problems for tree automata.
Note: This is joint work with Martin Grohe.
Thematic program: Semi-Structured Data (2004)

Gyorgy Turan	Inst. of Informations Systems, TU Wien, Favoritenstr.9-11, Seminarroom 1842, 3rd floor out of the elevator turn left, through the corridor, then turn right)	Thu, 21. Oct 04, 11:30
"On complexity problems for disjunctive normal forms"
We discuss some recent results on DNF. 1) The DNF exception problem: How much can the DNF size go up if it is `revised\' by changing some true points to false? 2) Extremal functions for the number of prime implicants: It is known that a k-term DNF can have at most 2^k - 1 prime implicants and this bound is sharp. We give an explicit description of all functions having this many prime implicants (as functions represented by a kind of decision tree). 3) Superpolynomial separation between disjoint (CNF,DNF) size and decision tree size: A DNF is disjoint if every two terms contain a complementary pair of literals. We also mention some related results which contribute to a comprehensive picture of the relationships between the different complexities of disjoint and general DNF and decision trees in the search tree, resp., decision tree models.
Note: Joint work with Dhruv Mubayi, Bob Sloan, Balazs Szorenyi and Yi Zhao.
Thematic program: Semi-Structured Data (2004)

Gio Wiederhold, Stanford University and MITRE Corporation	Inst. of Informations Systems, TU Wien, Favoritenstr.9-11, Seminarroom 1842, 3rd floor out of the elevator turn left, through the corridor, then turn right)	Fri, 22. Oct 04, 9:15
"What is Your Software Worth?"
I present a method for valuing the intellectual property inherent in software, While we, as software creators, believe that what we produce is valuable, we are rarely called upon to quantify its benefits. When benefits for commerce must be quantified it is left to lawyers, economists, software vendors, or promoters to assign value to our products. The results are often inconsistent. The approach applies well-known principles of intellectual property (IP) valuation, namely that its value is the income that use of that software is expected to generate in the future. Included are sales expectations, discounting to present value, and the like, always focusing on the specific issues that arise when the benefits of software are to be analyzed. An important issue, not dealt with in the literature of valuing intangibles, is that software is continually being upgraded. By focusing on ongoing maintenance we can replace simplistic depreciation rules. Some software engineering rules can be applied. I\'ll briefly present all steps of the process and integrate them via a simple quantitative example. Some conclusions are drawn that reflect on lacunae in academic and business practice. This work is fairly novel, and motivated by the increasing amount of cross-border transfer of intangible, not to say slithery, goods, which represent an ever-increasing portion of our economic world.
Thematic program: Semi-Structured Data (2004)

Tamir Hassan (University of Warwick)	TU Vienna, Institute of Informationssystems, Seminarroom 184/2 (DBAI), Favoritenstraße 9-11 / 1842, 3rd floor	Fri, 12. Nov 04, 13:30
Data Extraction from poorly structured formats -- PDF to HTML Conversion
This talk presents my project on PDF to HMTL Conversion, which was undertaken during my third year of study at Warwick University. Unlike similar converters on the market, it does not attempt to reconstruct the original layout and appearance of the page. Instead, it generates a correctly structured HTML documant from which the content can easily be lifted and re-used.; The resulting program can cope with various page structures, including columns, and can therefore generate good results with fairly complicated pages. Certain structures, such as tables, are not yet understood by the converter, and suggestions will be given for further work bboth to increase the range of understood layouts and to improve the reliability of the extraction process.; The project also provides a starting point for converting to a more descriptive XML-based format, which can be integrated into Lixto to provide a wrapper generator for PDF documents.
Thematic program: Semi-Structured Data (2004)

Weigel, Felix (LMU München)	TU Vienna, Institute of Informationssystems, Seminarroom 184/2 (DBAI), Favoritenstraße 9-11 / 1842, 3rd floor	Mon, 13. Dec 04, 17:00
Effizientes XML-Retrieval mit Knotenidentifikationsschemata
In digitalen Bibliotheken, Intranets und auch im Web finden sich immer mehr und immer größere XML-Datenbestände, oft bereits im Gigabyte-Bereich. Als Beispiele seien Literaturdatenbanken, elektronische Lexika, technische Dokumentationen, annotierte linguistische Corpora und Sammlungen medizinischer, geographischer oder astronomischer Daten genannt. Spezielle Indexverfahren für XML-Baumanfragen unterstützen die effiziente Suche nach Struktur undText in diesen Ressourcen. Jüngere Ansätze verwenden jedoch zusätzlich bestimmte Indikatoren (IDs) für XML-Elemente, mit deren Hilfe sich die Baumrelationen in der Anfrage höchst effizient entscheiden lassen. Die aus der Literatur bekannten ID-Schemata unterscheiden sich hinsichtlich ihres Platzbedarfs (ID-Größe) und Zeitbedarfs (Entscheidungsaufwand), aber auch hinsichtlich ihrer Expressivität (Menge der entscheidbaren Relationen). Mitunter lassen sich sogar Teile der Nachbarschaft eines Knotens (z.B. seine Vorfahren oder Geschwister) allein aus der Knoten-ID rekonstruieren.; Der Vortrag gibt einen kurzen Überblick über die aktuelle Forschung zu ID-Schemata und stellt einen neuen Ansatz vor, der unter dem Namen BIRD an der LMU München entwickelt worden ist und in Kombination mit verschiedenen Indexstrukturen eingesetzt werden kann. Es wird gezeigt, dass ddas BIRD-Shema ein Obermange der XPath Axes entscheiden und viele Relationen rekonstruieren kann. Im experimentellen Vergleich mit anderen Schemata belegt BIRD Spitzenplätze in punkto Zeiteffizienz bei gleichzeitiger moderater ID-Größe und nimmt somit eine interessante Stellung im Trade-off zwischen Expressivität, Platz- und Zeitbedarf ein. Offen ist bisher, inwieweit das Schema von bestimmten Eigenschaften der Indexstruktur abhängig ist oder für Optimierungen profitieren kann.
Thematic program: Semi-Structured Data (2004)

webmaster [Printable version]