Cofacts crowd-sourced transcript design doc

--- tags: cofacts --- # Cofacts crowd-sourced transcript design doc ## Requirements - Latest transcript are written to `text` of article, to support - Searching by keyword / full text search - Similar articles - Supports versioning, recording contributors - ~~Supports crowed-sourced review~~ No review needed if we have history -- just rollback when vandalized, like our meeting notes on hackmd. [name=mrorz] ~~- We see "review" as positive feedback~~ ~~- if the user is not satiefied with the latest transcript, they can submit new versions~~ - Supports [reverting to specific version](https://zh.wikipedia.org/zh-tw/Wikipedia:%E5%9B%9E%E9%80%80%E5%8A%9F%E8%83%BD) (Not implemented yet) - Should help fighting vandalism ## Integration This section documents how Cofacts system incoporates crowd-sourced transcript 1. When a new media article (video, audio, image) is created, API (rumors-api) would: - writes transcript in crowd-sourced transcript format (Y doc) - populates `text` field of article for indexing - Detail: [OCR and AI transcripts](https://g0v.hackmd.io/wkx286lmTDaFUpgRhnUawQ) 2. When the user starts editing transcript, UI (rumors-site) would: - loads prosemirror editor - connects to hocuspocus server collab-server to synchronize prosemirror state (including content change and other user's cursors) upon edit 3. When prosemirror state changes, collab-server would: - Store latest prosemirror state (in Y doc format, through `hocuspocus-extension-elasticsearch`) - Store snapshots in `ydocs` index in DB - Store `text` in `articles` index in DB 4. When the user views transcript history, UI would: - Retrieve snapshots from API - visualize changes on a read-only prosemirror editor ## UI (rumors-site) > https://github.com/cofacts/rumors-site rumors-site provides the following functionalities - Shows `text` - The actual text being indexed by Elasticsearch and used in full-text search - Edit button --> Activate real-time sync editor - Editor is implmented by prosemirror - The prosemirror state is just a paragraph with multi-line text nodes (contains `\n`) - History view: similar to Hackmd's - Also instaniates prosemirror editor, but with `ychange` [marks](https://prosemirror.net/docs/ref/version/0.11.0.html) this time - Loads snapshots from `versions` field of `GetYdocs` API :::spoiler Outdated design Transcript UI design ![](https://s3-ap-northeast-1.amazonaws.com/g0v-hackmd-images/uploads/upload_f62a325508175c0e9843f2496c0eb91b.png =x400) - Figma: https://www.figma.com/file/DvmAQjMJCncuPORWKnljM1/Cofacts-LIFF-and-new-designs?node-id=4514%3A923 - [Create first transcript flow](https://www.figma.com/proto/DvmAQjMJCncuPORWKnljM1/Cofacts-LIFF-and-new-designs?page-id=4514%3A923&node-id=4514%3A1404&viewport=565%2C378%2C0.31&scaling=min-zoom&starting-point-node-id=4514%3A1404&show-proto-sidebar=1) - Can create the first transcript - Encounters conflict (someone submits transcript after opening transcript editor) - Saves transcript after conflict - [Edit transcript / show history flow](https://www.figma.com/proto/DvmAQjMJCncuPORWKnljM1/Cofacts-LIFF-and-new-designs?page-id=4514%3A923&node-id=4514%3A924&viewport=692%2C464%2C0.31&scaling=min-zoom&starting-point-node-id=4514%3A924&show-proto-sidebar=1) - Can view latest transcript - Can view transcription history - Can see "revert" option and "report" option ::: ## DB > https://github.com/cofacts/rumors-db ### `ydocs` (elasticsearch) yjs db Stores yjs documents such as snapshots and users ![](https://i.imgur.com/o1QVVsC.png) - `_id` as document name - `data`: ydoc binary - users - lastest text - `versions`: version list - `snapshot` - `createdAt` ### `articles` - `transcribedAt`: max `contributors.lastUpdatedAt` - `contributors` - `contributors.userId`, `contributors.appId` - `contributors.lastUpdatedAt`: last contribute time of the user ## API (rumors-api) > https://github.com/cofacts/rumors-api ### `CreateMediaArticle` This API also creates media transcript generated by AI. The detail of AI generation is detailed in [OCR and AI transcripts](/wkx286lmTDaFUpgRhnUawQ). When storing the AI transcript, [API server](https://github.com/cofacts/rumors-api/blob/3a215c47079cbd88fff77d3e008d64a06a70430a/src/graphql/mutations/CreateMediaArticle.js#L176) would 1. Store the transcript in a prosemirror state with `prosemirror-schema-basic` schema - This schema is essentially a doc with multiple paragraphs - Besides paragraphs and text, other nodes are not used 2. Encapsulate the prosemirror state in a Y doc 3. Calculate initial snapshot and set the snapshot as created by an AI transcriber 4. Store the Y doc and snapshot into Elasticsearch `ydocs` index. ### `GetYdocs` - Retrieves Y doc from `ydocs` index - Can be used to load history ### `ListArticles` - Can filter by contributors :::spoiler Outdated design ### `UpdateArticleTranscript` This will: - Fills `text`, `transcribedAt` and `contributors` in the target article Authentication is required to call this mutation. #### Argument - `articleId`: target article of this transcript - `text`: the transcription text. - `contributors`: [contributor] #### Output Type: `TranscriptionError` - `TranscriptionError`: New object type that returns the latest transcript in a field ::: ## Collab-server > https://github.com/cofacts/collab-server - [`hocuspocus-extension-elasticsearch`](https://github.com/cofacts/collab-server/tree/master/hocuspocus-extension-elasticsearch) - An hocuspocus extension that synchronize Y doc changes to specified Elasticsearch instance (DB in Cofacts' case), using specified index name. - Y doc will be stored in [base64-encoded binary format](https://www.elastic.co/guide/en/elasticsearch/reference/current/binary.html) under `ydoc` field. - [Snapshot extension](https://github.com/cofacts/collab-server/blob/master/src/snapshot.ts) - Whenever UI disconnects (finish editing), create a snapshot and write to DB - Stores to `versions` field of `ydocs` index - Synchronizes latest transcript to `text` field of `articles` index, to support full-text search of `ListArticles` API. ## Previous survey See https://g0v.hackmd.io/aJqHn8f5QGuBDLSMH_EinA#Transcriptions ## Other ideas ### Use transcription for hyperlinks Useful for: - Youtube links - Youtube links can support more info (duration, etc) via [microdata](https://g0v.hackmd.io/aJqHn8f5QGuBDLSMH_EinA#schemaorg-microdata) - However youtube videos tend to be long, and not all of them has subtitles. - Crowdsourced transcripts can be helpful - Facebook posts & other websites that cannt fetch - consider provide refetch function first - as a last resort, maybe provide crowd-sourced transcript as well, so that users can copy-paste content from the website manually Expected difficulties: - `article.hyperlinks` are cached fields - Compared to attachment, one link can be included in multiple articles - thus supporting transcribing hyperlinks means that we must update all article's `hyperlinks` field in the same time

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.