Published
owned this note
owned this note
Linked with GitHub
---
tags: cofacts
---
# Cofacts crowd-sourced transcript design doc
## Requirements
- Latest transcript are written to `text` of article, to support
- Searching by keyword / full text search
- Similar articles
- Supports versioning, recording contributors
- ~~Supports crowed-sourced review~~ No review needed if we have history -- just rollback when vandalized, like our meeting notes on hackmd. [name=mrorz]
~~- We see "review" as positive feedback~~
~~- if the user is not satiefied with the latest transcript, they can submit new versions~~
- Supports [reverting to specific version](https://zh.wikipedia.org/zh-tw/Wikipedia:%E5%9B%9E%E9%80%80%E5%8A%9F%E8%83%BD) (Not implemented yet)
- Should help fighting vandalism
## Integration
This section documents how Cofacts system incoporates crowd-sourced transcript
1. When a new media article (video, audio, image) is created, API (rumors-api) would:
- writes transcript in crowd-sourced transcript format (Y doc)
- populates `text` field of article for indexing
- Detail: [OCR and AI transcripts](https://g0v.hackmd.io/wkx286lmTDaFUpgRhnUawQ)
2. When the user starts editing transcript, UI (rumors-site) would:
- loads prosemirror editor
- connects to hocuspocus server collab-server to synchronize prosemirror state (including content change and other user's cursors) upon edit
3. When prosemirror state changes, collab-server would:
- Store latest prosemirror state (in Y doc format, through `hocuspocus-extension-elasticsearch`)
- Store snapshots in `ydocs` index in DB
- Store `text` in `articles` index in DB
4. When the user views transcript history, UI would:
- Retrieve snapshots from API
- visualize changes on a read-only prosemirror editor
## UI (rumors-site)
> https://github.com/cofacts/rumors-site
rumors-site provides the following functionalities
- Shows `text`
- The actual text being indexed by Elasticsearch and used in full-text search
- Edit button --> Activate real-time sync editor
- Editor is implmented by prosemirror
- The prosemirror state is just a paragraph with multi-line text nodes (contains `\n`)
- History view: similar to Hackmd's
- Also instaniates prosemirror editor, but with `ychange` [marks](https://prosemirror.net/docs/ref/version/0.11.0.html) this time
- Loads snapshots from `versions` field of `GetYdocs` API
:::spoiler Outdated design
Transcript UI design
![](https://s3-ap-northeast-1.amazonaws.com/g0v-hackmd-images/uploads/upload_f62a325508175c0e9843f2496c0eb91b.png =x400)
- Figma: https://www.figma.com/file/DvmAQjMJCncuPORWKnljM1/Cofacts-LIFF-and-new-designs?node-id=4514%3A923
- [Create first transcript flow](https://www.figma.com/proto/DvmAQjMJCncuPORWKnljM1/Cofacts-LIFF-and-new-designs?page-id=4514%3A923&node-id=4514%3A1404&viewport=565%2C378%2C0.31&scaling=min-zoom&starting-point-node-id=4514%3A1404&show-proto-sidebar=1)
- Can create the first transcript
- Encounters conflict (someone submits transcript after opening transcript editor)
- Saves transcript after conflict
- [Edit transcript / show history flow](https://www.figma.com/proto/DvmAQjMJCncuPORWKnljM1/Cofacts-LIFF-and-new-designs?page-id=4514%3A923&node-id=4514%3A924&viewport=692%2C464%2C0.31&scaling=min-zoom&starting-point-node-id=4514%3A924&show-proto-sidebar=1)
- Can view latest transcript
- Can view transcription history
- Can see "revert" option and "report" option
:::
## DB
> https://github.com/cofacts/rumors-db
### `ydocs` (elasticsearch)
yjs db
Stores yjs documents such as snapshots and users
![](https://i.imgur.com/o1QVVsC.png)
- `_id` as document name
- `data`: ydoc binary
- users
- lastest text
- `versions`: version list
- `snapshot`
- `createdAt`
### `articles`
- `transcribedAt`: max `contributors.lastUpdatedAt`
- `contributors`
- `contributors.userId`, `contributors.appId`
- `contributors.lastUpdatedAt`: last contribute time of the user
## API (rumors-api)
> https://github.com/cofacts/rumors-api
### `CreateMediaArticle`
This API also creates media transcript generated by AI.
The detail of AI generation is detailed in [OCR and AI transcripts](/wkx286lmTDaFUpgRhnUawQ).
When storing the AI transcript, [API server](https://github.com/cofacts/rumors-api/blob/3a215c47079cbd88fff77d3e008d64a06a70430a/src/graphql/mutations/CreateMediaArticle.js#L176) would
1. Store the transcript in a prosemirror state with `prosemirror-schema-basic` schema
- This schema is essentially a doc with multiple paragraphs
- Besides paragraphs and text, other nodes are not used
2. Encapsulate the prosemirror state in a Y doc
3. Calculate initial snapshot and set the snapshot as created by an AI transcriber
4. Store the Y doc and snapshot into Elasticsearch `ydocs` index.
### `GetYdocs`
- Retrieves Y doc from `ydocs` index
- Can be used to load history
### `ListArticles`
- Can filter by contributors
:::spoiler Outdated design
### `UpdateArticleTranscript`
This will:
- Fills `text`, `transcribedAt` and `contributors` in the target article
Authentication is required to call this mutation.
#### Argument
- `articleId`: target article of this transcript
- `text`: the transcription text.
- `contributors`: [contributor]
#### Output
Type: `TranscriptionError`
- `TranscriptionError`: New object type that returns the latest transcript in a field
:::
## Collab-server
> https://github.com/cofacts/collab-server
- [`hocuspocus-extension-elasticsearch`](https://github.com/cofacts/collab-server/tree/master/hocuspocus-extension-elasticsearch)
- An hocuspocus extension that synchronize Y doc changes to specified Elasticsearch instance (DB in Cofacts' case), using specified index name.
- Y doc will be stored in [base64-encoded binary format](https://www.elastic.co/guide/en/elasticsearch/reference/current/binary.html) under `ydoc` field.
- [Snapshot extension](https://github.com/cofacts/collab-server/blob/master/src/snapshot.ts)
- Whenever UI disconnects (finish editing), create a snapshot and write to DB
- Stores to `versions` field of `ydocs` index
- Synchronizes latest transcript to `text` field of `articles` index, to support full-text search of `ListArticles` API.
## Previous survey
See https://g0v.hackmd.io/aJqHn8f5QGuBDLSMH_EinA#Transcriptions
## Other ideas
### Use transcription for hyperlinks
Useful for:
- Youtube links
- Youtube links can support more info (duration, etc) via [microdata](https://g0v.hackmd.io/aJqHn8f5QGuBDLSMH_EinA#schemaorg-microdata)
- However youtube videos tend to be long, and not all of them has subtitles.
- Crowdsourced transcripts can be helpful
- Facebook posts & other websites that cannt fetch
- consider provide refetch function first
- as a last resort, maybe provide crowd-sourced transcript as well, so that users can copy-paste content from the website manually
Expected difficulties:
- `article.hyperlinks` are cached fields
- Compared to attachment, one link can be included in multiple articles
- thus supporting transcribing hyperlinks means that we must update all article's `hyperlinks` field in the same time