Abstract:
Model training has revolutionized the utilization of published content, dismantling the traditional tripartite framework comprising creators, disseminators, and users within the publishing domain. It has reconfigured this landscape into a quadripartite structure involving creators, disseminators, data processors, and users. Regulating the use of published content data for model training within the existing copyright law framework tends to weaken the protection of publishers' interests. This approach presents several challenges, including the ambiguous legal status of publishers, the conflation of protected works with the object of published content data, the inadequacy of existing transactional models, and the ineffectiveness of copyright infringement regulations. To address these issues, it is imperative to advance the digital transformation of published content by establishing comprehensive published content data corpora. A synergistic approach integrating copyright and data protection should be developed, explicitly recognizing the rights and interests of publishers as data processors. The introduction of an "opt-out" mechanism for model training is essential to balance data protection with data utilization. Furthermore, promoting transparency in the use of content for model training and restricting the unauthorized scraping of published content data are crucial steps forward.