Interrupt Posting是在Interrupt Remapping的基礎上進一步提升了直通設備的中斷處理效率,使用Posting模式時,vcpu可以直接在non-root模式下處理中斷而不會被vm-exit到宿主機。
1、爲了支持中斷Posting,每個vcpu數據結構新增了一個Posted-Interrupt Descriptor結構體,該結構體主要保存post的部分信息。
struct pi_desc {
//每個bit代表一個vector,總共表示256個vector,posting哪個vector,對應bit就置1
u32 pir[8]; /* Posted interrupt requested */
union {
struct {
/* bit 256 - Outstanding Notification */
//描述有中斷Posting事件
u16 on : 1,
/* bit 257 - Suppress Notification */
//非緊急中斷,是否要立即通知
sn : 1,
/* bit 271:258 - Reserved */
rsvd_1 : 14;
/* bit 279:272 - Notification Vector */
//有兩個值:
//1)、POSTED_INTR_VECTOR
//vcpu新啓動,vcpu因爲被block重新被調到執行等場景會設置該值,表示硬件通知的中斷直接通知給vcpu處理
//2)、POSTED_INTR_WAKEUP_VECTOR
//vcpu被block時,會設置該值,表明硬件通知的中斷是通知的vcpu所在的物理cpu,物理cpu收到中斷事件後,會喚醒vcpu執行Posting的中斷
u8 nv;
/* bit 287:280 - Reserved */
u8 rsvd_2;
/* bit 319:288 - Notification Destination */
u32 ndst;
};
u64 control;
};
u32 rsvd[6];
} __aligned(64);
2、vmcs page新增了兩個域:POSTED_INTR_NV、POSTED_INTR_DESC_ADDR。create vcpu的時候,kvm檢查硬件是否使能apic virtualtion(kvm_intel模塊參數enable_apicv可開關控制),如果有使能,則將pi_desc對應物理地址寫到vmcs的POSTED_INTR_DESC_ADDR處,並將POSTED_INTR_NV設置爲POSTED_INTR_VECTOR。同時在vcpu調度過程中,如果vcpu處於block狀態,那kvm會將pi_desc.nv標記爲POSTED_INTR_WAKEUP_VECTOR,這樣IOMMU硬件有中斷posting需要處理,發送Notification vector時就會發送到vcpu所在的物理cpu,然後物理cpu再喚醒vcpu處理;如果vcpu處於running狀態時,pi_desc.nv會被重新標記爲POSTED_INTR_VECTOR,這樣IOMMU的Notification vector就直接發送到vcpu處理,不需要vm-exit。
if (kvm_vcpu_apicv_active(&vmx->vcpu)) {
vmcs_write64(EOI_EXIT_BITMAP0, 0);
vmcs_write64(EOI_EXIT_BITMAP1, 0);
vmcs_write64(EOI_EXIT_BITMAP2, 0);
vmcs_write64(EOI_EXIT_BITMAP3, 0);
vmcs_write16(GUEST_INTR_STATUS, 0);
vmcs_write16(POSTED_INTR_NV, POSTED_INTR_VECTOR);
vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
}
3、IOMMU的irte表默認開啓的是Remapping模式,kvm如果需要使用Posting模式,需要修改irte的mode爲posting模式,並將pi_desc地址信息告知給IOMMU硬件。kvm使用生產者、消費者的模式來完成中斷模式信息的變更。
Qemu首先通過VFIO_PCI_MSIX_IRQ_INDEX爲vfio分配一個eventfd,得到該eventfd對應的文件描述符,然後將該文件描述符通過KVM_IRQFD通知給kvm。
1)、vfio註冊生產者信息,併爲producer初始化token和irq,其中token爲eventfd的文件描述符,irq爲Host irq;
vfio_msi_set_vector_signal
vfio_msi_set_block
vfio_msi_set_vector_signal
irq_bypass_register_producer(註冊的producer)
trigger = eventfd_ctx_fdget(fd);
vdev->ctx[vector].producer.token = trigger;
vdev->ctx[vector].producer.irq = irq;
2)、kvm使用vfio生成的eventfd註冊一個consumer,然後調用vmx_update_pi_irte判斷是否通知硬件使用Posting模式;
kvm_irqfd_assign
irq_bypass_register_consumer
__connect
add_producer
kvm_arch_irq_bypass_add_producer
vmx_update_pi_irte
static int vmx_update_pi_irte(struct kvm *kvm, unsigned int host_irq,
uint32_t guest_irq, bool set)
{
//1、判斷是否有直通設備以及iommu硬件是否有posting能力
if (!kvm_arch_has_assigned_device(kvm) ||
!irq_remapping_cap(IRQ_POSTING_CAP))
return 0;
hlist_for_each_entry(e, &irq_rt->map[guest_irq], link) {
if (e->type != KVM_IRQ_ROUTING_MSI)
continue;
/*
* VT-d PI cannot support posting multicast/broadcast
* interrupts to a vCPU, we still use interrupt remapping
* for these kind of interrupts.
*
* For lowest-priority interrupts, we only support
* those with single CPU as the destination, e.g. user
* configures the interrupts via /proc/irq or uses
* irqbalance to make the interrupts single-CPU.
*
* We will support full lowest-priority interrupt later.
*/
kvm_set_msi_irq(kvm, e, &irq);
//判斷虛擬中斷是否只能路由到1個vcpu,如果不是,則只能使用remapping模式
if (!kvm_intr_is_single_vcpu(kvm, &irq, &vcpu)) {
/*
* Make sure the IRTE is in remapped mode if
* we don't handle it in posted mode.
*/
ret = irq_set_vcpu_affinity(host_irq, NULL);
if (ret < 0) {
printk(KERN_INFO
"failed to back to remapped mode, irq: %u\n",
host_irq);
goto out;
}
continue;
}
//設置pi_desc及vector信息
vcpu_info.pi_desc_addr = __pa(vcpu_to_pi_desc(vcpu));
vcpu_info.vector = irq.vector;
//調用iommu修改irte表項信息
irq_set_vcpu_affinity(host_irq, &vcpu_info);
}
}
irq_set_vcpu_affinity最終會調用intel_ir_set_vcpu_affinity,這裏就是最終修改irte的地方:
static int intel_ir_set_vcpu_affinity(int irq, void *info)
{
struct irq_2_iommu *ir_data = irq_2_iommu(irq);
struct vcpu_data *vcpu_pi_info = info;
/* stop posting interrupts, back to remapping mode */
if (!vcpu_pi_info) {
modify_irte(irq, get_irte(ir_data));
} else {
struct irte irte_pi;
/*
* We are not caching the posted interrupt entry. We
* copy the data from the remapped entry and modify
* the fields which are relevant for posted mode. The
* cached remapped entry is used for switching back to
* remapped mode.
*/
memset(&irte_pi, 0, sizeof(irte_pi));
dmar_copy_shared_irte(&irte_pi, get_irte(ir_data));
/* Update the posted mode fields */
//設置irte爲posting模式
irte_pi.p_pst = 1;
irte_pi.p_urgent = 0;
//設置vector
irte_pi.p_vector = vcpu_pi_info->vector;
//設置pi_desc信息
irte_pi.pda_l = (vcpu_pi_info->pi_desc_addr >>
(32 - PDA_LOW_BIT)) & ~(-1UL << PDA_LOW_BIT);
irte_pi.pda_h = (vcpu_pi_info->pi_desc_addr >> 32) &
~(-1UL << PDA_HIGH_BIT);
//修改irte
modify_irte(irq, &irte_pi);
}
return 0;
}
5、按以上步驟初始化好posting模式後,當IOMMU收到中斷時,就會根據irte的mode標誌來使用Posting模式,使用Posting時的中斷映射表項如下所示,這些信息初始值就是在第4步裏設置好的,在vcpu調度過程中也會有變化(參考第2步描述)。
6、想要使用Interrupt Posting特性,cpu還需要支持apicv功能,apicv會在vmcs的virtual-APIC page裏模擬一些虛擬的apic寄存器,如:
VTPR(virtual task-priority register);
VPPR(virtual processor-priority register);
VEOI(virtual end-of-interrupt register);
VISR(virtual interrupt-service register);
VIRR(virtual interrupt-request register);
當IOMMU判斷需要使用Posting發送中斷時,會將中斷信息寫在pi_desc裏,然後通過Notification event中斷通知vcpu有外部中斷要處理(假設這裏vcpu處於running,發送POSTED_INTR_VECTOR給vcpu,如果是發送POSTED_INTR_WAKEUP_VECTOR,kvm會完成中斷的注入,注入過程也會使用apicv功能),cpu根據pending的中斷將VIRR對應bit爲置1,然後使用virtual interrupt delivery能力進一步處理,處理完成後,執行EOI操作,virtual interrupt delivery的處理邏輯如下,其中:RVI、SVI保存在vmcs的Guest interrupt status區域(RVI:Requesting virtual interrupt;SVI:serviceing virtual interrupt)。這樣,外設中斷就直接在non-root模式下執行完成了。