Comment by bronxbomber92

bronxbomber92 Jun 20, 2025 parent

A question for the author(s) since they seem to be very responsive to this thread :).

1. How fine grain is each task? In a traditional matrix multiplication kernel, for example, each thread block is responsible for a small output tile of the resulting matrix. In Mirage's mega kernel, would there correspondingly be a task for each small output tile?

2. How does the Mirage compiler form the task graph? Does it have domain knowledge of every operator's data flow at the granularity of individual elements? Again taking matmul as an example: a given output output tile requires the correspond M_BLOCK rows of the A matrix. If the A matrix was itself an output of a prior matmul (+ nonlinearity), the dependees would be all of output tile tasks corresponding to those M_BLOCK rows of the operator that produced A?

zhihaojia 5 days ago

1. In MPK, each task is mapped to an individual SM. The amount of work handled by a task is similar to that of a thread block in the traditional kernel-per-operator approach.

2. TL;DR: MPK automatically analyzes inter-task dependencies by tracking the input and output tensors associated with each task. A longer version: Longer version: MPK uses imap, omap, and fmap (see Section 2 of the Mirage paper) to determine each task’s input and output tensors. A dependency is introduced between task A and task B if A produces any tensor elements that B consumes—that is, if A's outputs overlap with B's inputs.

> Again taking matmul as an example: a given output output tile requires the correspond M_BLOCK rows of the A matrix. If the A matrix was itself an output of a prior matmul (+ nonlinearity), the dependees would be all of output tile tasks corresponding to those M_BLOCK rows of the operator that produced A?

Exactly. In this case, all output tile tasks that consume those M_BLOCK rows of A will depend on all tasks responsible for producing the corresponding parts of A in the previous operator.

This item has no comments currently.